<<

On the optimality of solutions of the max-pro duct b elief propagation in arbitrary graphs.

Yair Weiss and William T. Freeman

Abstract

Graphical mo dels, suchasBayesian networks and Markov random elds, represent statistical dep endencies of variables by

a graph. The max-pro duct \b elief propagation" algorithm is a lo cal-message passing algorithm on this graph that is known to

converge to a unique xed p oint when the graph is a . Furthermore, when the graph is a tree, the assignment based on the

xed-p oint yields the most probable a p osteriori MAP values of the unobserved variables given the observed ones.

Recently, go o d empirical p erformance has b een obtained by running the max-pro duct algorithm or the equivalent min-sum

algorithm on graphs with lo ops, for applications including the deco ding of \turb o" co des. Except for two simple graphs cycle

co des and single lo op graphs there has b een little theoretical understanding of the max-pro duct algorithm on graphs with lo ops.

Here we prove a result on the xed p oints of max-pro duct on a graph with arbitrary topology and with arbitrary probability

distributions discrete or continuous valued no des. We show that the assignment based on a xed-p oint is a \neighborhood

maximum" of the p osterior probability: the p osterior probability of the max-pro duct assignment is guaranteed to b e greater than

all other assignments in a particular large region around that assignment. The region includes all assignments that di er from

the max-pro duct assignmentinany subset of no des that form no more than a single lo op in the graph. In some graphs this

neighb orho o d is exp onentially large. We illustrate the analysis with examples.

keywords: Belief propagation, max-pro duct, min-sum, Bayesian networks, Markov random elds, MAP estimate.

Problems involving probabilistic b elief propagation arise in a wide variety of applications, including error correcting

co des, sp eech recognition and image understanding. Typically, a is assumed over a set of

variables and the task is to infer the values of the unobserved variables given the observed ones. The assumed probability

distribution is describ ed using a graphical mo del [14] | the qualitative asp ects of the distribution are sp eci ed bya

graph structure. The graph may either be directed as in a [18], [12] or undirected as in a Markov

Random Field [18], [10].

Here we fo cus on the problem of nding an assignment for the unobserved variables that is most probable given the

observed ones. In general, this problem is NP hard [19] but if the graph is singly connected i.e. there is only one

path b etween anytwo given no des then there exist ecient lo cal message{passing schemes to p erform this task. Pearl

[18] derived suchascheme for singly connected Bayesian networks. The algorithm, which he called \b elief revision",

is identical to his algorithm for nding p osterior marginals over no des except that the summation op erator is replaced

with a maximization. Aji et al. [2] have shown that b oth of Pearl's can b e seen as sp ecial cases of generalized

distributive laws over particular semirings. In particular, Pearl's algorithm for nding maximum a p osteriori MAP

assignments can b e seen as a generalized distributivelawover the max-pro duct semiring. We will henceforth refer to it

as the \max-pro duct" algorithm.

Pearl showed that for singly connected networks, the max-pro duct algorithm is guaranteed to converge and that the

assignment based on the messages at convergence is guaranteed to give the optimal assignment{values corresp onding to

the MAP solution.

Several groups have recently rep orted excellent exp erimental results by running the max-pro duct algorithm on graphs

with lo ops [23], [6], [3], [20], [6], [11]. Benedetto et al. used the max-pro duct algorithm to deco de \turb o" co des and

obtained excellent results that were slightly inferior to the original turb o deco ding algorithm which is equivalent to

the sum-pro duct algorithm. Weiss [20] compared the p erformance of sum-pro duct and max-pro duct on a \toy" turb o

co de problem while distinguishing b etween converged and unconverged cases. He found that if one considers only the

convergent cases, the p erformance of max-pro duct deco ding is signi cantly b etter than sum-pro duct deco ding. However,

the max-pro duct algorithm converges less often so its overall p erformance including b oth convergent and nonconvergent

cases is inferior.

Progress in the analysis of the max-pro duct algorithm has b een made for two sp ecial top ologies: single lo op graphs,

and \cycle co des". For graphs with a single lo op [23], [20], [21], [5], [2], it can b e shown that the algorithm converges

to a stable xed p oint or a p erio dic oscillation. If it converges to a stable xed-p oint, then the assignment based on

the xed-p oint messages is the optimal assignment. For graphs that corresp ond to cycle co des low density paritycheck

co des in which each bit is checked by exactly twocheck no des, Wib erg [23] gave sucient conditions for max-pro duct to

converge to the transmitted co deword and Horn [11] gave sucient conditions for convergence to the MAP assignment.

In this pap er we analyze the max-pro duct algorithm in graphs of arbitrary topology. We show that at a xed-p ointof

the algorithm, the assignment is a \neighb orho o d maximum" of the p osterior probability: the p osterior probabilityof

the max-pro duct assignment is guaranteed to b e greater than all other assignments in a particular large region around

that assignment. These results motivate using this p owerful algorithm in a broader class of networks.

Y. Weiss is with Computer Science Division, 485 So da Hall, UC Berkeley, Berkeley, CA 94720-1776. E-mail: [email protected] erkeley.edu.

Supp ort by MURI-ARO-DAAH04-96-1-0341, MURI N00014-00-1-0637 and NSF I IS-9988642 is gratefully acknowledged.

W. T. Freeman is with MERL, Mitsubishi Electric Research Labs., 201 Broadway, Cambridge, MA 02139. E-mail: [email protected]. 1 x1 x1

x2 x3 x2 x3 x7=(x2’,x3’) x5 x4 x5 x4

x6

x6

a b

Fig. 1. AnyBayesian network can b e converted into an undirected graph with pairwise cliques by adding cluster no des for all parents that

share a common child. a. ABayesian network. b. The corresp onding undirected graph with pairwise cliques. A cluster no de for x ;x 

2 3

has b een added. The p otentials can b e set so that the joint probability in the undirected graph is identical to that in the Bayesian

network. In this case the up date rules presented in this pap er reduce to Pearl's propagation rules in the original Bayesian network [21].

I. The max-product algorithm in pairwise Markov Random Fields

Pearl's original algorithm was describ ed for directed graphs, but in this pap er we fo cus on undirected graphs. Every

directed graphical mo del can b e transformed into an undirected graphical mo del b efore doing inference see gure 1.

An undirected graphical mo del or a  is a graph in which the no des represent variables and

arcs represents compatibility relations b etween them. Assuming all probabilities are nonzero, the Hammersley-Cli ord

theorem e.g. [18] guarantees that the probability distribution will factorize into a pro duct of functions of the maximal

cliques of the graph.

Denoting by x the values of all unobserved variables in the graph, the factorization has the form:

Y

P x= x  1

c c

c

where x is a subset of x that form a clique in the graph and is the p otential function for the clique.

c c

We will assume, without loss of generality, that each x no de has a corresp onding y no de that is connected only to

i i

x .

i

Thus:

Y Y

P x; y = x  x ;y  2

c c ii i i

c

i

The restriction that all the y variables are observed and none of the x variables are is just to make the notation

i i

simple | x ;y  may be indep endent of y equivalent to y b eing unobserved or x ;y  may be  x x 

ii i i i i ii i i i o

equivalenttox b eing observed, with value x .

i o

In describing and analyzing b elief propagation, we assume the graphical mo del has b een prepro cessed so that all the

cliques consist of pairs of units. Any graphical mo del can b e converted into this form b efore doing inference through a

suitable clustering of no des into large no des [21]. Figure 1 shows an example | a Bayesian network is converted into

an MRF in which all the cliques are pairs of units.

Equation 2 b ecomes

Y Y

x ;y  3 x ;x  P x; y =

ii i i ij i j

i i;j

where the rst pro duct is over connected pairs of no des.

The imp ortant prop erty of MRFs that we will use throughout this pap er is the Markov blanket prop erty. The

C

probability of a subset of no des S given all other no des in the graph S dep ends only on the values of the no des that 2

immediately neighbor S . Furthermore, the probabilityof S given all other no des is prop ortional to the pro duct of all

clique p otentials within S and all clique p otentials b etween S and its immediate neighb ors.

P x jx C  = P x jN x  4

S S S

S

Y Y

= x ;x  x ;x 

ij i j ij i j

x ;x 2S

x 2S;x 2N S 

i j

i j

5

The advantage of prepro cessing the graph into one with pairwise cliques is that the description and the analysis of

b elief propagation b ecomes simpler. For completeness, we review the b elief propagation scheme used in [21]. As we

discuss in the app endix, this b elief propagation scheme is equivalenttoPearl's b elief propagation algorithm in directed

graphs, the Generalized Distributive Law algorithm of [1] and the propagation algorithm of [13]. These

three algorithms corresp ond to the algorithm presented here with a particular way of prepro cessing the graph in order

to obtain pairwise p otentials.

Atevery iteration, each no de sends a di erent message to each of its neighb ors and receives a message from each

neighb or. Let x and x be two neighb oring no des in the graph. We denote by m x  the message that no de x sends

i j ij j i

to no de x ,by m x  the message that y sends to x , and by b x  the b elief at no de x .

j ii i i i i i i

The max-pro duct up date rules are:

Y

m x  m x   max x ;x m x 

ki i ij j ij i j ii i

x

i

x 2N x nx

i j

k

6

Y

b x   m x  m x  7

i i ii i ki i

x 2N x 

i

k

where denotes a normalization constant and N x nx means all no des neighb oring x , except x .

i j i j

The pro cedure is initialized with all message vectors set to constant functions. Observed no des do not receive

 

messages and they always transmit the same vector| if y is observed to have value y then m x  = x ;y .

i ii i ii i

The normalization of m in equation 6 is not necessary{whether or not the message are normalized, the b elief b will

ij i

b e identical. However, normalizing the messages avoids numerical under ow and adds to the stability of the algorithm.

We assume throughout this pap er that all no des simultaneously up date their messages in parallel.

For singly connected graphs it is easy to show that:

 The algorithm converges to a unique xed p oint regardless of initial conditions in a nite numb er of iterations.

 At convergence, the b elief for anyvalue x of a no de i is the maximum of the p osterior, conditioned on that no de

i

having the value x : b x = max P xjy; x .

i i i x i

 

 De ne the max-pro duct assignment, x by x = arg max b x  assuming a unique maximizing value exists. Then

x i i

i

i



x is the MAP assignment.

The max pro duct assignment assumes there are no \ties" | that a unique maximizing x exists for all bx . Ties can

i i

arise when the MAP assignment is not unique, e.g. when there are two assignments that have identical p osterior and

b oth maximize the p osterior. For singly connected graphs, the converse is also true: if there are no ties in bx  then

i

the MAP assignment is unique. In what follows, we assume a unique MAP assignment.

In particular applications, it might b e easier to work in the log domain so that the pro duct op eration in equations 7 is

replaced by a sum op erations. Thus the max-pro duct algorithm is sometimes referred to as the max-sum algorithm or

the min-sum algorithm [23], [5]. If the graph is a chain, the max-pro duct is a two-wayversion of the

in Hidden Markov Mo dels and is closely related to concurrent dynamic programming [4]. Despite this connection to

well studied algorithms, there has b een very little analytical success in characterizing the solutions of the max-pro duct

algorithm on arbitrary graphs with lo ops.

II. What are the fixed points of the max-product algorithm?

t

Each iteration of the max-pro duct algorithm can b e thought of as an op erator F that inputs a list of messages m

t+1 t

and outputs a list of messages m = Fm . Thus running b elief propagation can b e thought of as an iterativeway

0

of nding a solution to the xed p oint equations Fm = m with an initial guess m in which all messages are constant

functions.

Note that this is not the only way of nding xed-p oints. Murphy et al. [17] describ e an alternative metho d for nding

xed-p oints of F . They suggested iterating:

t+1 t t1

m = F m +1 m 8 3

a b c

Fig. 2. a. A25x25 grid of p oints. b.-c. Examples of subsets of no des that form no more than a single lo op. In this pap er we prove that

changing such a subset of no des from the max-pro duct assignment will always decrease the p osterior probability

   

Here again, if iterations of equation 8 converge to m then m satis es m = Fm .

 

The equation m = Fm is a highly nonlinear equation and it is not obvious how many solutions exist or how to

characterize them. Horn [11] showed that in single-lo op graphs, F can b e considered as matrix multiplication over the

max-pro duct semiring and xed p oints corresp ond to eigenvectors of that matrix. Thus even in single lo op graphs, one

can construct examples with anynumb er of xed p oints.

The main result of this pap er is a characterization of howwell the max-pro duct assignment approximates the MAP

 

assignment. We show that the assignment x must b e a neighb orho o d maximum of P xjy : that is P x jy  >Pxjy



for all x in a particular large region around x . This condition is weaker than a global maximum but stronger than a

lo cal maximum.



To be more precise we de ne the Single Loops and Trees SLT neighborhood of an assignment x in a graphical



mo del G to include all assignments x that can b e obtained from x by:

 Cho osing an arbitrary subset S of no des in G that consists of disconnected combinations of trees and single lo ops.

 Assigning arbitrary values to x | the chosen subset of no des. The other no des have the same assignmentasin

S



x .



Claim 1: For an arbitrary graphical mo del with arbitrary p otentials, if m is a xed-p oint of the max-pro duct

    

algorithm and x is the assignment based on m then P x jy  >Pxjy for all x 6= x in the SLT neighb orho o d of x .

Figure 2 illustrates example con gurations within the the SLT neighb orho o d of the max-pro duct assignment. It shows

examples of subsets of no des that could b e changed to arbitrary values and the p osterior probability of the assignment

is guaranteed to b e worse than that of the max-pro duct assignment.

To build intuition, we rst describ e the pro of for a sp eci c case, the diamond graph of gure 3. The general pro of is

given in section I I-B.

A. Speci c Example

We start by giving an overview of the pro of for the diamond graph shown in gure 3a. The pro of is based on the

unwrapp ed tree | the graphical mo del that the lo opy b elief propagation is solving exactly when applying the b elief

propagation rules in a lo opy network [9], [23], [21], [22]. In error-correcting co des, the unwrapp ed tree is referred to as

the \computation tree" | it is based on the idea that the computation of a message sentby a no de at time t dep ends on

messages it received from its neighb ors at time t 1 and those messages dep end on the messages the neighb ors received

at time t 2 etc.

Figure 3 shows an unwrapp ed tree around no de x for the diamond shap ed graph on the left. Each no de has a shaded

1

observed no de attached to it that is not shown for simplicity.



To simplify notation, we assume that x , the assignment based on a xed-p oint of the max-pro duct algorithm is equal



~

to zero, x =0. The p erio dic assignment lemma from [22] guarantees that we can mo dify x ;y  for the leaf no des

ii i i

so that the optimal assignment in the unwrapp ed tree is all zeros

P ~x =0jy~ = max P ~xjy~ 9

x~

~

The x ;y  are mo di ed to include the messages from the no des to b e added at the next stage of the unwrapping.

ii i i 4 ~ x1

x1 ~ ~x2 ~x3 x4

~ ~ x2 x3 x4 x5 ~x5’ x5’’

~ ~ ~ ~ ~ x4’ x3’~ x2’ x4’ x3’’ x2’’ x5 ~ ~ ~ ~ ~ ~

x1’ x1’’ x1’’’x1‘ x1‘‘ x1‘‘‘

a b

Fig. 3. a: A Markov Random Field with multiple lo ops. b: The unwrapp ed graph corresp onding to this structure. The unwrapp ed graphs

are constructed by replicating the p otentials x ;x  and observations y while preserving the lo cal connectivity of the lo opy graph.

i j i

They are constructed so that the messages received bynode x after t iterations in the lo opy graph are equivalent to those that would

1

b e received by x in the unwrapp ed graph. An observed no de, y , not shown, is connected to each depicted no de.

1

i

Wenow show that the global optimalityofx~= 0 and the metho d of construction of the unwrapp ed tree guarantee

that P x =0jy >Pxjy for all x in the SLT neighb orho o d of 0.

Referring to gure 3a, supp ose that P x = 10000jy  > P x = 00000jy . By the Markov prop ertyof the diamond

gure, this means that:

P x =1jx =0 >Px =0jx =0 10

1 24 1 24

Note that no dex ~ has exactly the same neighb ors in the unwrapp ed graph as x has in the lo opy graph. Furthermore,

1 1

by the metho d of construction, the p otentials b etween x and each of its neighb ors is the same as the p otentials b etween

1

x~ and its neighb ors. Thus equation 10 implies that:

1

P ~x =1jx~ =0 >P~x =0jx~ =0 11

1 24 1 24

in contradiction to equation 9. Thus no change of a single x can improve the p osterior probability.

i

What ab out changing two x at a time? If wechange a pair that is not connected in the graph, say x and x , then

i 1 5

by the Markov prop erty this is equivalent to changing one at a time. Thus supp ose P 10001 > P 0000 this again

implies that P x =1jx =0 >Px =0jx = 0 and wehave shown earlier that leads to a contradiction. Thus

1 24 1 24

no change of assignmentintwo unconnected no des can improve the p osterior probability.

If the two are connected, say x ;x then the same argument holds with resp ect to the pair of no des x ;x . Note that

1 2 1 2

the subgraphx ~ is isomorphic to the subgraph x and the two subgraphs have the same neighb ors. Hence:

12 12

P x =1jx =0 >Px =0jx =0 12

12 35 12 35

implies that:

P ~x =1jx~ =0 >P~x =0jx~ =0 13

12 35 12 35

and this is again in contradiction to equation 9. Thus no change of assignmentintwo connected no des can improve the

p osterior probability.

Similar arguments show that no change of assignmentinany subtree of the graph can improve the p osterior probability

e.g. changing the values of x or changing the values of x and x .

14 12 45

These arguments no longer hold, however, when wechange a subset of no des that form a lo opy subgraph of G. For

example, the subgraph x ;x ;x ;x is not isomorphic to the subgraph x~ ; x~ ; x~ ; x~ . Indeed since the unwrapp ed tree

1 2 3 5 1 2 3 5

is a tree, it cannot have a lo opy subgraph. Hence we cannot equate the probabilities of the two subgraphs given their

neighb ors.

If the subset forms a single lo op, however, then there exists an arbitrarily long chain in the unwrapp ed tree that

0

~

corresp onds to the unwrapping of that lo op. For example, note the chainx ~ ; x~ ; x~ ; x ;  in gure 3b. Using a similar

1 2 5 3

argument to that used in proving optimality of max-pro duct in a single lo op graph [23], [20], [2], [5] we can show that

if we can improve the p osterior in the lo opy graph bychanging the value of x ;x ;x ;x then we can also improve the

1 2 3 5

p osterior in the unwrapp ed graph bychanging the values of the arbitrarily long chain. This again leads to a contradiction. 5

B. Proof of Claim 1:

~

We denote by G the original graph and by G the unwrapp ed graph. We use x for no des in the original graph andx ~

i i

~

for no des in the unwrapp ed graph. We de ne a mapping C from no des in G to no des in G. This mapping will say for

~

eachnodeinG what is the corresp onding no de in G: C ~x =x .

1 1

~

The unwrapp ed tree is therefore a graph G and a corresp ondence map. Wenow give the metho d of constructing b oth.

Pick an arbitrary no de in G,sayx . Set C ~x =x . Iterate t times.

1 1 1

~

 Find all leaves of G start with the ro ot.

 For each leaf x~ nd all k no des in G that neighbor C ~x .

i i

 Add k 1 no des as children to x , corresp onding to all neighb ors of C ~x  except C ~x , where x~ is the parentof

i i j j

x~ .

i

The p otential matrices and observations for each no de in the unwrapp ed network are copied from the corresp onding

no des in the lo opy graph. That is, if x = C ~x  then y~ = y , and:

j i i j

~

~x ; x~ = C~x ;C~x  14

i j i j

Note that the unwrapp ed tree around x after t iterations is a subtree of the unwrapp ed tree around x after t >t

i 1 j 2 1

~

iterations. If we let the number of iterations t ! 1 then the unwrapp ed tree G b ecomes a well-studied ob ject in

top ology: the universal covering of G [16]. Roughly sp eaking, it is a top ology that preserves the local top ology of the

graph G but is singly connected. It is precisely this fact, that max-pro duct gives the global optimum on a graph that



has the same lo cal top ology as G, that makes sure that x is a neighb orho o d maximum of P xjy .

~

Wenow state some prop erties of G.

~

1. Equal neighbors property: Every non-leaf no de in G has the same numb er of neighb ors as the corresp onding no de

in G and the neighb ors are in one-to-one corresp ondence. If C ~x =x then for each~x 2N~x , C ~x  2 N x  and for

i i j i j i

each x 2 N x  there exists x~ 2 N ~x  such that C ~x =x . This follows directly from the metho d of constructing

j i k j k j

~

G.

~

2. Equal conditional probability property: The probability of a nonleaf no dex ~ given its neighb ors in G is equal to the

j

probabilityofC~x  given its neighb ors. Formally,if C~x =x then

j j i

P ~x jN ~x =z; y=P x jNx =z; y 15

j j i i

This follows from the Markov blanket prop erty of MRFs eq. 5 and the equal neighb orho o d prop erty.

3. Isomoprhic subtreeproperty: For any subtree T  G then for suciently large unwrapping count t there exists an

~ ~

isomorphic subtree T  G. The no des of the subtrees are in one-to-one corresp ondence: for each x 2 T there exists a

i

~ ~

x~ 2 T such that C ~x =x and for each~x 2T,C~x  2 T . To prove this we pick the ro ot no de of T , x as the initial

i i i j j i

no de around which to expand the unwrapp ed tree. By the metho d of construction, the unwrapp ed tree after a number

~

of iterations equal to the depth of T will b e isomoprhic to T and in one-to-one corresp ondence. This gives us T . For

~ ~

any other choice of initial no de for G, the unwrapp ed tree starting with x is a subtree of G .

i

~ ~

4. In nite chain property: For anyloopLG there exists an arbitrarily long chain L  G that is the unwrapping of

L. Furthermore, if we denote by n the length of the chain divided by the length of the lo op and by x~ and x~ the two

1 N

endp oints of the chain then:

n

P ~x jN ~x  = P x jN X  16

~ ~

L L

L L

with the \b oundary p otentials"

Y Y

~ ~

= x ~ ; x~  ~x ; x~  17

i 1 i N

x~ 2N ~x nx~ x~ 2N ~x nx~

i 1 2 i N N 1

~

Note that is indep endentofn. x and x~ refer to all no de variables in the sets L and L, resp ectively.

~

L

L

The existence of the arbitrarily long chain follows from the equal neighb or prop erty, while equation 16 follows from

the Markov blanket prop erty for MRFs equation 5.

Another prop erty of the unwrapp ed tree that we will need was proven in [22]:

 

5. Periodic assignment lemma: Let m be a xed-p oint of the max-pro duct algorithm and x the max-pro duct

~ ~

assignmentin G. Let G b e the unwrapp ed tree. Supp ose we mo dify the observation p otentials at the leaf no des to

ii



~

include the messages from the no des to b e added at the next stage of unwrapping. Then the MAP assignment~x in G

  

is a replication of x : if x = C ~x  then x~ = x .

i j

j i



Using these prop erties, we can prove the main claim. To simplify notation, we again assume that x , the assignment



based on a xed-p oint of the max-pro duct algorithm is equal to zero, x = 0. The p erio dic assignment prop erty

~

guarantees that we can mo dify x ;y  for the leaf no des so that the optimal assignment in the unwrapp ed tree is all

ii i i

zeros

P ~x =0jy~ = max P ~xjy~ 18

x~ 6

Now, assume that we can cho ose a subtree of T  G and change the assignment of these no des x to another value and

T

increase the p osterior. Again, to simplify notation assume that maximizing value is x =1. By the Markov prop erty,

T

this means that:

P x =1jNx =0;y >Px =0jNx =0;y 19

T T T T

~ ~

Now, by the isomorphic subtree prop ertywe know that there exists T  G that is isomorphic to T . We also know

~

that T has the same conditional probability given its neighb ors as T do es. Thus equation 19 implies that:

P ~x =1jN~x =0;y~>P~x =0jN~x =0;y~ 20

~ ~ ~ ~

T T T T

in contradiction to equation 18. Hence changing the value of a subtree from the converged values of the max-pro duct

algorithm cannot increase the p osterior probability.

Now, assume wechange the value of a single lo op L  G and increase the p osterior probability. This means that:

P x =1jNx =0;y >Px =0jNx =0;y 21

L L L L

By the in nite chain prop erty,we know that for arbitrarily large n:

n

P ~x =1jN~x =0=Px =1jNx =0 22

~ ~

L L 1

L L

Therefore equation 21 implies that:

P ~x =1jN~x =0>P~x =0jN~x =0 23

~ ~ ~ ~

L L L L



in contradiction to the optimalityofx~ =0. Hence changing the value of a single lo op cannot improve the p osterior



probabilityover x .

Nowwe taketwo disconnected comp onents C ;C 2 G and assume that changing the values of x ;x improves the

1 2 C C

1 2

p osterior probability. Again, by the Markov prop erty:

P x =1;x =1jNx ;x =0;y >

C C C C

1 2 1 2

Px =0;x =0jNx ;x =0;y 24

C C C C

1 2 1 2

but since C and C are not connected this implies:

1 2

P x =1jNx =0;yPx =1jNx =0;y >

C C C C

1 1 2 2

Px =0jNx =0;yPx =0jNx =0;y 25

C C C C

1 1 2 2

Which implies that either

P x =1jNx =0;y >Px =0jNx =0;y 26

C C C C

1 1 1 1

or:

P x =1jNx =0;y >Px =0jNx =0;y 27

C C C C

2 2 2 2

Thus if C or C are either a tree or a single lo op this leads to a contradiction. Hence we cannot simultaneously change

1 2

the values of two subtrees or of a subtree and a lo op or of two disconnected lo ops and increase the p osterior probability.

Similarly,we can show that changing the value of any nite numb er of disconnected trees or single lo ops will not increase

the p osterior. This proves claim 1.

III. Examples

Claim 1 holds for arbitrary top ologies and arbitrary p otentials b oth discrete and continuous no des. We illustrate

the implications of claim 1 for sp eci c networks.

A. Gaussian graphical models

A Gaussian graphical mo del is one in which the joint distribution over x is Gaussian. Weiss and Freeman [22] have

analyzed b elief propagation on such graphs. One of the results given there can also b e proved using our claim 1.

Corol lary 1: For a Gaussian graphical mo del of arbitrary top ology. If b elief propagation converges, then the p osterior

marginal means calculated using b elief-propagation are exact.

Proof: For Gaussians, max-pro duct and sum-pro duct are identical. The p osterior means calculated by b elief propa-

gation are therefore identical to the max-pro duct assignment. By claim 1, we know that this must b e a neighborhood

maximum of the p osterior probability. But Gaussians are unimo dal hence it must b e a global maximum of the p osterior

probability. Thus the max-pro duct assignmentmust equal the MAP assignment, and the p osterior means calculated

using b elief propagation are exact. 7 message 1

bit 1 bit 2 bit 3 bit 4 bit 5 bit 6 bit 7

message 2

Fig. 4. Turb o co de structure

100 100

80 greedy 80 max−product 60 60

40 40 greedy max−product Percent Correct

20 Percent converged 20

0 0 0 1 2 3 0 1 2 3

noise σ noise σ

a b

Fig. 5. Results of small turb o-co de simulation. a Percent correct deco dings for max-pro duct algorithm, compared with greedy gradient

ascent in the p osterior probability. b Comparison of convergence results of the two algorithms.

B. Turbo-codes

Figure 4 shows the pairwise Markov network corresp onding to the deco ding of a turb o co de with 7 unknown bits. The

top and b ottom no des represent the two transmitted messages one for each constituent co de. Thus in this example, the

top and b ottom no des can take on 128 p ossible values. The p otentials b etween the message no des and their observations

give the p osterior probability of the transmitted word given one message, and the p otentials b etween the message no des

and the bit no des imp ose consistency. For example bit ; message =1 if the rst bit of message is equal to bit

1 1

1 1

and zero otherwise. It is easy to show [21] that sum-pro duct b elief propagation on this graph gives the turb o deco ding

algorithm and max-pro duct b elief propagation gives the mo di ed turb o deco ding algorithm of [3].



Corol lary 2: For a turb o co de with arbitrary constituent co des. Let x b e a xed-p oint max-pro duct deco ding. Then

 

P x jy  >Pxjy for all x within Hamming distance 2 of x .

Proof: This follows from the main claim. Note that whenever wechange any bits in the graph we also havetochange

the two message no des so changing more than two bits will givetwo lo ops.

An obvious consequence of corollary 2 is that for a turb o co de with arbitrary constituent co des, the max-pro duct

deco ding is either the MAP deco ding or at least Hamming distance 3 from the MAP deco ding. In other words, the

max-pro duct algorithm cannot converge to a deco ding that is \almost" right: if it is wrong it must be wrong in at

least three bits. In order for max-pro duct to converge to a wrong deco ding, there must exist a deco ding that is at

least distance 3 from the MAP deco ding, and that deco ding must have higher p osterior probability than anything in its

neighborhood. If no such wrong deco ding exists, the max-pro duct algorithm must either converge to the MAP deco ding

or fail to converge to a xed-p oint.

This b ehavior can be contrasted with the b ehavior of \greedy" iterative deco ding which increases the p osterior

probability of the deco ding at every iteration. Greedy iterative deco ding checks all bits and compares the p osterior

probability with the currentvalue of that bit versus ipping that bit. If the p osterior probability improved with ipping,

the algorithm ips it this is equivalent to free energy deco ding [15] at zero temp erature. This greedy deco ding algorithm

is guaranteed to converge to a local maximum of the p osterior probability.

To illustrate these prop erties we ran the following simulations. We simulated transmitting turb o-enco ded co dewords

of length 7 bits over a Gaussian channel. We compared the max-pro duct deco ding and the greedy deco ding to the MAP

deco ding since wewere dealing with such short blo ck lengths we could calculate the MAP deco ding using exhaustive

search. Wevaried the noise  of the Gaussian channel.

Figure 5 shows the results. Figure 5a shows the probabilityof convergence for b oth algorithms. Convergence was

determined numerically for b oth algorithms: if the standard deviation of the messages over 10 successive iterations was

5

less than 10 we declared convergence. If this criterion was not achieved in 100 iterations, we called that run a failure

to converge. 8

Figure 5b shows the probability of a correct deco ding i.e. a deco ding equal to the MAP deco ding for the two

algorithms on the cases for which b oth converged. When max-pro duct converges, it always nds the MAP deco ding. In

contrast, greedy deco ding very frequently converges to a wrong deco ding.

C. 2D grids

Figure 2 a showsatwo dimensional grid. For two dimensional grids, it is easy to show the following corollaries:

Corol lary 3: For two dimensional grids of arbitrary size and arbitrary p otentials. Any con guration is either 1 in

the SLT region of the max-pro duct assignment or 2 in the SLT region of an assignment that is in the SLT region of

the max-pro duct assignment.

Corol lary 4: For two dimensional grids of size n. The size of the SLT neighb orho o d increases exp onentially with n.

Both corollaries follow from the fact that we can change the value of all the even or o dd rows to an arbitrary value.

We compared greedy deco ding to max-pro duct deco ding on the 3x3 grid. Max-pro duct found the MAP deco ding in

99 of the runs and when it was wrong, its assignmentwas always the second b est. In comparison, greedy deco ding

found the MAP assignment only on 44 of the runs and the ranking was anywhere b etween 4 and 61.

IV. Discussion

The idea of using the computation tree to prove prop erties of the max-pro duct assignmentwas also used in [23], [20],

[8], [11]. The main to ol in those analyses was the fact that the max-pro duct assignmentwas the global optimum in the

unwrapp ed tree. There are two problems with generalizing this approach to arbitrary top ologies.

First, the global optimum dep ends on the numerosity of no de replicas in the unwrapp ed tree. That is, di erent no des

~

in G mayhave di erentnumb er of replicas in G. This leads to a distinction b etween balanced or nonskewed graphs [20],

[8] and unbalanced or skewed graphs. Balanced graphs are those for which all no des in G have asymptotically the same

~ ~

numb er of replicas in G. For unbalanced graphs, it is much more dicult to relate global optimalityinGto optimality

in G.

Second, the global optimum in the unwrapp ed tree contains contributions from the interior no des that have exactly

~

the same neighb ors in G as do their corresp onding no des in G and contributions from the leaf no des that are missing

~

some of the neighb ors in G. Unfortunately, for most graphs G the numb er of leaf no des grows at the same rate as the

non-leaf no des and cannot b e neglected from the analysis.

In this analysis, on the other hand, we used primarily the local prop erties of the computation tree. No matter what

~

the top ology of G is, it is always the case that the lo cal structure of G is the same as the lo cal structure of G. Thus the

~

numerosity of the no des in G and the ratio of leaf no des to non-leaf no des is irrelevant. In this way,we can analyze the

max-pro duct assignment in arbitrary top ologies.

Although we exploited the lo cal prop erties, wewould like to extend our analysis using the global prop erties as well.

Our simulation results indicate that the max-pro duct assignment is b etter than our analytical results guarantee. For

example, in the turb o co de simulations we found that the p osterior probability often contained twoSLT maxima but

for all these cases, max-pro duct found the global maximum and not the second SLT maximum. In currentwork, we

are lo oking into using the global prop erties of the computation tree to extend our analysis.

Appendix

Relationship between belief propagation schemes

A. Converting a factor graph to a pairwise Markov graph

A factor graph [7] is a with function no des f denoted by lled squares and variable no des x denoted

i i

by un lled circles. The function no des denote a decomp osition of a "global" function g x into a pro duct of \lo cal"

functions f x. We will assume that g represents a joint distribution over the variable no des. The metho d of converting

i

a factor graph into a pairwise Markov graph is to remove the function no des. Sp eci cally:

 nd all f of degree 2. For each such f remove it from the graph and directly connect the twovariables no des that

i i

were connected to those function no des. That is, if f is a degree 2 no de connected to x ;x we directly connect x

1 1 2 1

and x and set x ;x =fx ;x .

2 1 2 1 2

 for all f of degree d>2, replace the no de f with a new variable no de x . The variable no de x represents

i i N +i N +i

the joint con guration of all the x that were connected to f . That is if f is connected to x ;x ;x then the

i i 2 2 3 4

0 0 0 0

new variable x is a vector with three comp onents x =x ;x ;x . Set the p otentials x ;x =x x .

6 5 N +i j j

2 3 4

j

Finally, add an observation no de y and set x ;y =f x.

N +i N +i N+i i

It is easy to show that 1 The joint distribution over the variables x x in the pairwise Markov graph is exactly

1 N

f x and 2 the b elief propagation algorithm in the pairwise Markov graph is equivalent to the b elief propagation

algorithm in [7]. 9 x3 x4

x5=(x2’,x3’,x4’)

f2 y5

x1 f1 x2 x4 x1 x2 x3

a b

Fig. 6. Any factor graph can b e converted into a Markov random eld with pairwise p otentials that represents exactly the same probability

distribution over variables. When this conversion is done, the b elief propagation algorithm for the pairwise Markov graph is equivalent

to the b elief propagation algorithm on the factor graph.

B. Converting a junction graph to a pairwise Markov graph

A junction graph [1] is a graph in whichvertices s represent "lo cal domains" of a global function. Thus if g x ;x ;x 

i 1 2 3

factorizes so that:

g x ;x ;x =f x ;x f x ;x  28

1 2 3 1 1 2 2 2 3

then the two lo cal domains are s =x ;x  and s =x ;x . Edges b etween these vertices corresp ond to \communi-

1 1 2 2 2 3

cation links" in a message passing scheme for calculating marginals of g .

Aji et al. showed that for such a message passing algorithm to exist, the junction graph must p ossess the \running

intersection prop erty" | the subset of no des whose domains include x together with the edges containing these no des

i

must form a connected graph. Wenow show that junction graphs are equivalent to pairwise Markov graphs.

To show this we leave the graph b etween s unchanged and add \observation" no des y such that s ;y =fs.

i i i i i

We set s ;s  = 1 if the two no des agree on the value of any x that exists in b oth domains and zero otherwise. Note

i j i

that the running intersection prop erty guarantees that anytwo no des not necessarily neighb oring must agree on the

value of a common x for the joint distribution to b e nonzero. When the p otentials are set in this way, it is easy to see

i

that the joint distribution over x in the pairwise Markov graph is exactly g x and that the b elief propagation algorithm

in the Markov graph is equivalent to the GDL algorithm in [1].

References

[1] S. M. Aji and R.J. McEliece. The generalized distributivelaw. IEEE Transactions on , 2000. in press.

[2] S.M. Aji, G.B. Horn, and R.J. McEliece. On the convergence of iterative deco ding on graphs with a single cycle. In Proc. 1998 ISIT,

1998.

[3] S. Benedetto, G. Montorsi, D. Divsalar, and F. Pollara. Soft-output deco ding algorithms in iterative deco ding of Turb o co des. Technical

Rep ort 42{124, JPL TDA, 1996.

[4] D. P. Bertsekas. Dynamic Programming: Deterministic and Stochastic Models. Prentice Hall, 1987.

[5] G.D. Forney, F.R. Kschischang, and B. Marcus. Iterative deco ding of tail-biting trellisses. preprint presented at 1998 Information Theory

Workshop in San Diego, 1998.

[6] W.T. Freeman and E.C. Pasztor. Learning to estimate scenes from images. In M.S. Kearns, S.A. Solla, and D.A. Cohn, editors, Adv.

Neural Information Processing Systems 11. MIT Press, 1999.

[7] BJ.Frey, R. Ko etter, and A. Vardy.Skewness and pseudo co dewords in iterative deco ding. In Proc. 1998 ISIT, 1998.

[8] BJ.Frey, R. Ko etter, and A. Vardy. Signal space characterization of iterative deco ding. IEEE Trans. Info. Theory, 2001. Sp ecial issue

on Co des on Graphs and Iterative Algorithms.

[9] R.G. Gallager. Low Density Parity Check Codes. MIT Press, 1963.

[10] S. Geman and D. Geman. Sto chastic relaxation, Gibbs distributions, and the Bayesian restoration of images. IEEE Trans. PAMI,

66:721{741, Novemb er 1984.

[11] G.B. Horn. Iterative decoding and pseudocodewords. PhD thesis, Department of Electrical Engineering, California Institute of Technology,

Pasadena, CA, May 1999.

[12] F.V. Jensen. An Introduction to Bayesian Networks. Springer, 1996.

[13] F. R. Kschischang, B. J. Frey, and H.A. Lo eliger. factor graphs and the sum-pro duct algorithm. IEEE Transactions on Information

Theory, 1998. submitted.

[14] S. Lauritzen. Graphical Models. Oxford University Press, 1996.

[15] D.J.C. Mackay and Radford M. Neal. Go o d error-correcting co des based on very sparse matrices. IEEE Transactions on Information

Theory, 1999.

[16] J.R. Munkres. Topology: a rst course. Prentice Hall, 1975.

[17] K.P. Murphy,Y.Weiss, and M.I. Jordan. Lo opy b elief propagation for approximate inference: an empirical study. In Proceedings of

Uncertainty in AI, 1999.

[18] J. Pearl. Probabilistic Reasoning in Intel ligent Systems: Networks of Plausible Inference. Morgan Kaufmann, 1988.

[19] S. E. Shimony. Finding the maps for b elief networks is NP-hard. Arti cial Intel ligence, 682:399{410, 1994.

[20] Y. Weiss. Belief propagation and revision in networks with lo ops. Technical Rep ort 1616, MIT AI lab, 1997. 10

[21] Y. Weiss. Correctness of lo cal probability propagation in graphical mo dels with lo ops. Neural Computation, to app ear, 2000.

[22] Y. Weiss and W. Freeman. Correctness of b elief propagation in gaussian graphical mo dels of arbitrary top ology. In S. Solla, T. K. Lean,

and K. R. Muller, editors, Advances in Neural Information Processing Systems 12, 2000.

[23] N. Wib erg. Codes and decoding on general graphs. PhD thesis, Department of Electrical Engineering, U. Linkoping, Sweden, 1996.

I. biographies of the authors

A. Yair Weiss

Yair Weiss is a p ostdo ctoral scientist at the UC Berkeley department of Electrical Engineering and Computer Science.

He received his PhD from MIT in 1998 where he worked with Ted Adelson on Bayesian mo dels for vision. His research

interests include human and machine vision, probabilistic inference and learning, and neural computation.

B. Wil liam T. Freeman

William T. Freeman is a Senior Research Scientist at MERL, Mitsubishi Electric Research Labs. He studied computer

vision for his PhD in 1992 from the Massachussetts Institute of Technology, and received a BS in physics and MS in

electrical engineering from Stanford in 1979, and an MS in applied physics from Cornell in 1981.

His current researchinterests include machine learning applied to computer vision, Bayesian mo dels of visual p ercep-

tion, and interactive applications of computer vision. In 1997, he received the Outstanding Pap er prize at the Conference

on Computer Vision and Pattern Recognition for work on applying bilinear mo dels to "separating style and content".

From 1981 - 1987, Dr. Freeman worked in electronic imaging at the the Polaroid Corp oration, developing algorithms

for color image reconstruction used in Polaroid's digital camera. In 1987-88, was a Foreign Exp ert at the Taiyuan

UniversityofTechnology,P. R. China. Dr. Freeman is an Asso ciate Editor of IEEE-PAMI. 11