Probabilistic Graphical Models

Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 5: & Factor Graphs

Some figures courtesy Michael Jordan’s draft textbook, An Introduction to Probabilistic Graphical Models Inference in Graphical Models

xE observed evidence variables (subset of nodes)

xF unobserved query nodes we’d like to infer x remaining variables, extraneous to this query R but part of the given graphical representation

p(xE,xF )= p(xE,xF ,xR) R = V E,F \{ } x XR Posterior Marginal Densities p(xE,xF ) p(xF xE)= p(xE)= p(xE,xF ) | p(xE) x XF • Provides Bayesian estimators, confidence measures, and sufficient statistics for iterative parameter estimation • The elimination algorithm assumed a single query node. What if we want the marginals for all unobserved nodes? Inference in Undirected Trees

• For a tree, the maximal cliques are always pairs of nodes: 1 p(x)= (x ,x ) (x ) Z st s t s s (s,t) s Y2E Y2V Inference via the Distributed Law Inference via the Distributed Law Inference via the Distributed Law X1 X1

m ()x m12 ()x1 12 2

X2 X2

m32 ()x2 m42 ()x2 m32 ()x2 m42 ()x2

X 3 X4 X 3 X4 Computing Multiple Marginals(a) (b)

X1 XX1 1X1 X1X1 X1 X1

m12 ()x2 m m()12x()x2 m12 ()x1 m12m()12x1()x1 12 2 m12 ()x2 m12 ()x2 m12 ()x1

X2 XX2 2X2 X2X2 X2 X2 m ()x m ()x m m()x ()x m m()x ()x m ()x m ()x m32 ()x2 m42 ()x2 32 2 42 2 m3232 ()322x2 2 m4242 ()422x2 2 m32 ()32x2 2 m42 ()42x2 2 m32 ()x2

m24 ()x 4 m23 ()x3 m24 ()x4 X 3 X4 XX3 3X 3 XX4 4X4 X 3X 3 X4X4 X 3 X4 X 3 X4

(a) (b)(a)(a) (b)(b) (c) (d)

X X X X 1 X1 1 1 • CanX1 compute1 all marginals, at all m12 ()x2 mm12 m()x()12x()x2 m12 ()x m m()()x m m()()x 12 2 2 1 12 12x2nodes,2 12 12x1 1by combining incoming messages from adjacent edges X2 XX2 2X2 X2X2 m ()x m ()x m ()x m ()x m () 32 2 42 2 m32 ()32x2 •2 Each messagem42 ()42x2 2 must only be 32 x2 m32m()32x2()x2 m24 ()x4 m () m24m()24x()x4 m computed() m () once, via some 23 x3 24 x4 m23 ()23x3x3 m24 ()24x4x4 X X X X X4 X X X4 3 4 X3 3 3 X4 4 X 3 3 message updateX4 schedule (c) (d)(c)(c) (d)(d) Belief Propagation (Sum-Product) BELIEFS: Posterior marginals

neighborhood of node t (adjacent nodes) MESSAGES: Sufficient statistics

I) Message Product II) Message Propagation Message Update Schedules • Message Passing Protocol: A node can send a message to a neighboring node when, and only when, it has (a)(a) (b) (b) received incoming messages from all of its other neighbors

(a) (b) (c)(c) (d) (d) • Synchronous Parallel Schedule: At each iteration, every node computes all outputs for which it has needed inputs • Global Sequential Schedule: Choose some node as the root of the true. Pass messages from the leaves to the root, and then from the root back to the leaves.

(c)• Asynchronous (d) Parallel Schedule: Initialize messages arbitrarily. At each iteration, all nodes compute all outputs from all current inputs. Iterate until convergence. Belief Propagation for Trees

• Dynamic programming algorithm which exactly computes all marginals • On Markov chains, BP equivalent to alpha-beta or forward-backward algorithms for HMMs • Sequential message schedules require each message to be updated only once • Computational cost: number of nodes discrete states for each node Belief Prop: Brute Force: BP for Continuous Variables Sec. 2.3. Variational Methods and Message Passing Algorithms 71

Sec. 2.3. Variational Methods and Message Passing Algorithms 71 70 CHAPTER 2. NONPARAMETRIC ANDx GRAPHICAL3 MODELS

x3 xj\i xl\i x1 x2 xl x1 x2 xi xj x4 xk x4 p(x ) ψ (x )ψ x(xk\i ,x )ψ (x )ψ (x ,x )ψ (x )ψ (x ,x )ψ (x ) dx dx dx 1 ∝ 1 1 12 1 2 2 2 23 2 3 3 3 24 2 4 4 4 4 3 2 p(x ) ψ (x !!!)ψ (x ,x )ψ (x )ψ (x ,x )ψ (x )ψ (x ,x )ψ (x ) dx dx dx 1 ∝ 1 1 12 1 2 2 2 23 2 3 3 3 24 2 4 4 4 4 3 2 Figure 2.14. For!!! a tree–structuredψ1(x1) graph,ψ12(x1 each,x2) nodeψ2(xi2partitions)ψ23(x2,x the3) graphψ3(x3 into)ψ24|Γ((xi2)|,xdisjoint4)ψ4(x subtrees.4) dx4 dx3 dx2 x ∝ x Conditionedψ on(x )i, the variablesψ (x!!!,xj)\ψi in(x these)ψ subtrees(x ,x )ψ are(x independent.)ψ (x ,x )ψ (x ) dx dx dx ∝ 1 1 12 1 2 2 2 23 2 3 3 3 24 2 4 4 4 4 3 2 !!!ψ (x ) ψ (x ,x )ψ (x ) ψ (x ,x )ψ (x )ψ (x ,x )ψ (x ) dx dx dx ∝ 1 1 12 1 2 2 2 23 2 3 3 3 24 2 4 4 4 4 3 2 ψ1(x1) ψ12(x1,x!2)ψ2(x2) ψ23(x2,x"!!3)ψ3(x3)ψ24(x2,x4)ψ4(x4) dx4 dx3 dx2 # can be∝ efficiently decomposedBP is into Exact a set of simpler, for localTrees computations. In particular, ! "!! # for tree–structuredψ graphical1(x1) ψ12 models(x1Proof,x2) aψ generalization2on(x2 Board) ψ23 (x2 of,x3 d)ynamicψ3(x3) dx programming3 ψ24(x2 known,x4)ψ4(x4) dx4 dx2 ψ (x ) ∝ψ (x ,x )ψ (x ) ψ (x ,x )ψ (x ) dx ψ (x ,x·)ψ (x ) dx dx as belief∝ propagation1 1 12 (BP)1 !2[1782 , 2312 , 25523] recursively2 "3! 3 3 computes3 · exact24 2 posterior#4 "!4 4 marginals4 2 # in linear time. In! the following sections,"! we provide a briefm32#(x"derivation!2) of BP, andm discuss#42(x2) m (x ) m (x ) issues arising in its implementation. We then32$ present2 a vari%& ational interpretation42' $ 2 of%& BP ' which justifies extensions to graphs$ withm21( cycles.x1%&) ψ12(x1',x$2)ψ2(x2)m%&32(x2) m42(x'2) dx2 $m (x ) ψ (x ,x∝ )ψ (x )m (x %&) m (x ) dx ' 21 1 ∝ 12 1 2 ! 2 2 32 2 42 2 2 $ ! %& ' Message PassingFigure in Trees 2.15. Example derivation of the BP message passing recursion through repeated application Figure 2.15.of theExample distributive derivation law. of theBecause BP message the joint passing distribution recursionp( throughx) factorizes repeated as aapplication product of pairwise clique Considerof the a distributive pairwise law. MRF, Because parameterized the joint distribution as inp( Sec.x) factorizes2.3.1, as whose a product underlying of pairwise graph clique potentials, the joint integral can be decomposed via messages mji(xi) sent between neighboring nodes. =(potentials,, ) is tree–structured. the joint integral can be As decomposed shown in via Fig. messages2.14m,ji any(xi) sent node betweeni neighboringdivides nodes. such G V E ∈ V atreeinto Γ(i) disjoint subsets: | | To derive the BP algorithm, we begin by considering the clique potentials corre- To derive the BP algorithm, we begin by considering the clique potentials corre- spondingj i ! j to particulark no subsets path of from thek full graph:j intersects i (2.110) sponding to particular\ { } subsets∪ { ∈ ofV| the full graph: → } Ψ (x ) ! ψij(xi,xj) ψi(xi,y) (2.111) By the Markov propertiesΨA(xA) of! , theA A variablesψij(xi,xxj) inψ thesei(xi,y) sub–trees are conditionally(2.111) in- j\i A ⊂ V A ⊂ V G(i,j)∈E(A) (i,j)∈E(iA∈)A i∈A dependent given xi. The BP algorithm( exploits( ( this structure( to recursively decompose the computationHere, ( Here,) ! of p((i,x( ji) )y!) intoi,(i, j a j) seriesare ofi, the j simpler, edgesare localcontained the calculations. edges in containedthe node–induced in the sub-node–induced sub- E A { E A|∈ E|{ ∈ A}∈ E| ∈ A} Fromgraph the[50 Hammersley–Cli]graph corresponding[50] corresponding tofford. Using Theorem, to the partitions. Markov Using the illustrated properties partitions in ar Fig.illustratede expressed2.14, we in can through Fig. then2.14, we can then A A the algebraicwrite thestructure marginalwrite the distribution of marginal the pairwise distribution of any MRF’s node of as factorization any follows: node as into follows: clique potentials. As illustrated in Fig. 2.15, tree–structured graphs allow multi–dimensional integrals (or summations) to bep decomposed(xi y) p(x intoψyi)( axi series,y) ofψψ simpler,(xij(,yxi),xj) one–dΨ ψ(ximensional(x ),xdx) Ψ integrals.(x )(2.112)dx As (2.112) | ∝ i | ∝ i i j\iij j\i j V\ij\i j\i V\i in dynamic programming [24, 90XV\, 303i ], theXj overallV\∈Γi(i) integral can then be computed via a ! ! ( j∈(Γ(i) recursion involving messages sent between neighboring nodes. This decomposition is an ψi(xi,y) ψi(xi,yψ)ij(xi,xj) Ψjψ\iij(x(jx\i),xdxj)jΨ\i (x )(2.113)dx (2.113) instance of the same distributive∝ law underlying∝ X a variety of other algorithmsj\i [j4\,i50, 255j\i ], j∈Γ(i) ! j\i Xj\i including the fast Fourier transform. Critically,( becausej∈(Γ(i) !messages are shared among sim- ilar decompositions associated with different nodes, BP efficiently and simultaneously computes the desired marginals for all nodes in the graph. BP Algorithm BP Algorithm BP Algorithm Inference for Graphs with Cycles • For graphs with cycles, the dynamic programming BP derivation breaks

Junction Tree Algorithm • Cluster nodes to break cycles • Run BP on the tree of clusters • Exact, but often intractable Loopy Belief Propagation • Iterate local BP message updates on the graph with cycles • Hope beliefs converge • Empirically, often very effective… A Brief History of Loopy BP • 1993: Turbo codes (and later LDPC codes, rediscovered from Gallager’s 1963 thesis) revolutionize error correcting codes (Berrou et. al.) • 1995-1997: Realization that turbo decoding algorithm is equivalent to loopy BP (MacKay & Neal) • 1997-1999: Promising results in other domains, & theoretical analysis via computation trees (Weiss) • 2000: Connection between loopy BP & variational approximations, using ideas from statistical physics (Yedidia, Freeman, & Weiss) • 2001-2007: Many results interpreting, justifying, and extending loopy BP 50 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS Pairwise Markov Random Fields

x1 1 x2 x1 x2 x1 x2 p(x)= (x ,x ) (x ) Z st s t s s (s,t) s Y2E Y2V • Simple xparameterization,3 but xstill3 x3 expressive and widely used in practice • Guaranteed Markov with respect to graph x4 x5 x4 x5 x4 x5

(a) (b) (c)

Figure 2.4. Three graphical representations of a distribution over five random variables (see [175]). (a) Directed graph G depicting a causal, generative process. (b) Factor graph expressing the factoriza- tion underlying G. (c) A “moralized” undirected graph capturing the Markov structure of G. set of undirected edges (s,t) linking pairs of nodes For example,E in the factor graph of Fig. 2.5(c), there are 5 variable nodes, and the joint set of N nodes or vertices, 1, 2,...,N distributionV has one potential for each of the 3{ hyperedges: } normalizationp(x) ψ (x constant,x ,x ) ψ (partition(x ,x ,x function)) ψ (x ,x ) Z ∝ 123 1 2 3 234 2 3 4 35 3 5 Often, these potentials can be interpreted as local dependencies or constraints. Note, however, that ψf (xf ) does not typically correspond to the marginal distribution pf (xf ), due to interactions with the graph’s other potentials. In many applications, factor graphs are used to impose structure on an exponential family of densities. In particular, suppose that each potential function is described by the following unnormalized exponential form:

ψ (x θ )=ν (x )exp θ φ (x ) (2.67) f f | f f f  fa fa f  a$∈Af  Here, θ ! θ a are the canonical parameters of the local exponential family f { fa | ∈ Af } for hyperedge f. From eq. (2.66), the joint distribution can then be written as

p(x θ)= ν (x ) exp θ φ (x ) Φ(θ) (2.68) | f f  fa fa f −  ( f)∈F * f$∈F a$∈Af  Comparing to eq. (2.1), we see that factor graphs define regular exponential fami- lies [104, 311], with parameters θ = θ f , whenever local potentials are chosen { f | ∈ F} from such families. The results of Sec. 2.1 then show that local statistics, computed over the support of each hyperedge, are sufficient for learning from training data. This 50 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS Undirected Graphical Models x x 1 x x x x 1 p(x)=2 1 (x ) 2 1 2 Z c c c Y2C • Parameterization exactly captures those x3 x3 x3 non-degenerate distributions which are Markov with respect to this graph • Sometimesx restrictedx tox maximal cliques,x x x but4 this is not necessary5 4 5 4 5 (a) (b) (c)

Figure 2.4. Three graphical representations of a distribution over five random variables (see [175]). (a) Directed graph Gsetdepicting of cliques a causal, (fully generative connected process. subsets) (b) Factor graphof nodes expressing the factoriza- tion underlying G. (c) A “moralized” undirected graph capturing the Markov structure of G. C set of undirected edges (s,t) linking pairs of nodes For example,E in the factor graph of Fig. 2.5(c), there are 5 variable nodes, and the joint set of N nodes or vertices, 1, 2,...,N distributionV has one potential for each of the 3{ hyperedges: } normalizationp(x) ψ (x constant,x ,x ) ψ (partition(x ,x ,x function)) ψ (x ,x ) Z ∝ 123 1 2 3 234 2 3 4 35 3 5 Often, these potentials can be interpreted as local dependencies or constraints. Note, however, that ψf (xf ) does not typically correspond to the marginal distribution pf (xf ), due to interactions with the graph’s other potentials. In many applications, factor graphs are used to impose structure on an exponential family of densities. In particular, suppose that each potential function is described by the following unnormalized exponential form:

ψ (x θ )=ν (x )exp θ φ (x ) (2.67) f f | f f f  fa fa f  a$∈Af  Here, θ ! θ a are the canonical parameters of the local exponential family f { fa | ∈ Af } for hyperedge f. From eq. (2.66), the joint distribution can then be written as

p(x θ)= ν (x ) exp θ φ (x ) Φ(θ) (2.68) | f f  fa fa f −  ( f)∈F * f$∈F a$∈Af  Comparing to eq. (2.1), we see that factor graphs define regular exponential fami- lies [104, 311], with parameters θ = θ f , whenever local potentials are chosen { f | ∈ F} from such families. The results of Sec. 2.1 then show that local statistics, computed over the support of each hyperedge, are sufficient for learning from training data. This Factor Graphs 1 p(x)= (x ) Z f f f Y2F • In a , the hyperedges link arbitrary subsets of nodes (not just pairs) • Visualize by a , with square (usually black) nodes for hyperedges • A factor graph associates a non-negative potential function with each hyperedge • Motivation: factorization key to computation

set of hyperedges linking subsets of nodes f F ✓ V set of N nodes or vertices, 1, 2,...,N V { } Z normalization constant (partition function) Factor Graphs & Factorization 1 p(x)= (x ) Z f f f Y2F • For a given undirected graph, there exist distributions which have equivalent Markov properties, but different factorizations and different inference/learning complexities: Undirected Pairwise (edge) Potentials on Alternative Potentials Maximal Cliques Factorization Pairwise Nearest-Neighbor MRF Low Density Parity Check (LDPC) Code Directed Graphs as Factor Graphs

Directed Graphical Model: N p(x)= p(x x ) i | (i) i=1 Y Corresponding Factor Graph: N

p(x)= i(xi,x(i)) i=1 Y

• Associate one factor with each node, linking it to its parents and defined to equal the corresponding conditional distribution • Information lost: Directionality of conditional distributions, and fact that global partition function Z =1