Sum-Product Inference for Factor Graphs, Learning Directed Graphical Models
Total Page:16
File Type:pdf, Size:1020Kb
Probabilistic Graphical Models Brown University CSCI 2950-P, Spring 2013 Prof. Erik Sudderth Lecture 6: Sum-Product Inference for Factor Graphs, Learning Directed Graphical Models Some figures courtesy Michael Jordan’s draft textbook, An Introduction to Probabilistic Graphical Models 50 CHAPTER 2. NONPARAMETRIC AND GRAPHICAL MODELS Pairwise Markov Random Fields x1 1 x2 x1 x2 x1 x2 p(x)= (x ,x ) (x ) Z st s t s s (s,t) s Y2E Y2V • Simple xparameterization,3 but xstill3 x3 expressive and widely used in practice • Guaranteed Markov with respect to graph x4 x5 x4 x5 x4 x5 (a) (b) (c) Figure 2.4. Three graphical representations of a distribution over five random variables (see [175]). (a) Directed graph G depicting a causal, generative process. (b) Factor graph expressing the factoriza- tion underlying G. (c) A “moralized” undirected graph capturing the Markov structure of G. set of undirected edges (s,t) linking pairs of nodes For example,E in the factor graph of Fig. 2.5(c), there are 5 variable nodes, and the joint set of N nodes or vertices, 1, 2,...,N distributionV has one potential for each of the 3{ hyperedges: } normalizationp(x) ψ (x constant,x ,x ) ψ (partition(x ,x ,x function)) ψ (x ,x ) Z ∝ 123 1 2 3 234 2 3 4 35 3 5 Often, these potentials can be interpreted as local dependencies or constraints. Note, however, that ψf (xf ) does not typically correspond to the marginal distribution pf (xf ), due to interactions with the graph’s other potentials. In many applications, factor graphs are used to impose structure on an exponential family of densities. In particular, suppose that each potential function is described by the following unnormalized exponential form: ψ (x θ )=ν (x )exp θ φ (x ) (2.67) f f | f f f fa fa f a$∈Af Here, θ ! θ a are the canonical parameters of the local exponential family f { fa | ∈ Af } for hyperedge f. From eq. (2.66), the joint distribution can then be written as p(x θ)= ν (x ) exp θ φ (x ) Φ(θ) (2.68) | f f fa fa f − ( f)∈F * f$∈F a$∈Af Comparing to eq. (2.1), we see that factor graphs define regular exponential fami- lies [104, 311], with parameters θ = θ f , whenever local potentials are chosen { f | ∈ F} from such families. The results of Sec. 2.1 then show that local statistics, computed over the support of each hyperedge, are sufficient for learning from training data. This Belief Propagation (Sum-Product) BELIEFS: Posterior marginals pˆ (x ) (x ) m (x ) t t / t t ut t u Γ(t) 2Y xt neighborhood of node t Γ(t) (adjacent nodes) MESSAGES: Sufficient statistics m (x ) (x ,x ) (x ) m (x ) ts s / st s t t t ut t xt u Γ(t) s X 2Y \ I) Message Product II) Message Propagation xs xt Belief Propagation for Trees • Dynamic programming algorithm which exactly computes all marginals • On Markov chains, BP equivalent to alpha-beta or forward-backward algorithms for HMMs • Sequential message schedules require each message to be updated only once • Computational cost: number of nodes discrete states for each node Belief Prop: Brute Force: Factor Graphs 1 p(x)= (x ) Z f f f Y2F • In a hypergraph, the hyperedges link arbitrary subsets of nodes (not just pairs) • Visualize by a bipartite graph, with square (usually black) nodes for hyperedges • A factor graph associates a non-negative potential function with each hyperedge • Motivation: factorization key to computation set of hyperedges linking subsets of nodes f F ✓ V set of N nodes or vertices, 1, 2,...,N V { } Z normalization constant (partition function) Factor Graphs & Factorization 1 p(x)= (x ) Z f f f Y2F • For a given undirected graph, there exist distributions which have equivalent Markov properties, but different factorizations and different inference/learning complexities: Undirected Pairwise (edge) Potentials on Alternative Graphical Model Potentials Maximal Cliques Factorization Directed Graphs as Factor Graphs Directed Graphical Model: N p(x)= p(x x ) i | Γ(i) i=1 Y Corresponding Factor Graph: N p(x)= i(xi,xΓ(i)) i=1 Y • Associate one factor with each node, linking it to its parents and defined to equal the corresponding conditional distribution • Information lost: Directionality of conditional distributions, and fact that global partition function Z =1 Sum-Product Algorithm Belief Propagation for Factor Graphs ft Xj µti ()xi νjs()xj νis()xi µsi()xi Xi fs Xi fs µui()xi νks()xk fu Xk (a) (b) • From each variable node, the incoming and outgoing messages are functions only of that particular variable • Factor message updates must sum over all combinations of the adjacent variable nodes (exponential in degree) Comparing Sum-Product Variants Xi Xi µ si()xi mji ()xi fs ψ ()xj Xj fr Xj µ ti ()xj µuj()xj mkj ()xj mlj (xj ) ft fu Xk Xl Xk Xl (a) (b) • For pairwise potentials, there is one “incoming” message for each outgoing factor message, simplifies to earlier algorithm: m (x ) (x ,x ) (x ) m (x ) ts s / st s t t t ut t xt u Γ(t) s X 2Y \ Factor Graph Message Schedules X1 f d X 2 fe X 3 • All of the previouslyX1 X 2 discussedX 3 message schedules are valid X f d X fe X • Here is an example of a synchronous1 parallel2 schedule:3 X1 X 2 X 3 X1 f d X 2 fe X 3 fa f b fc X X X 1 (a)2 3 X1 f d (b)X 2 fe X 3 fa f b fc X1 X 2 X 3 () () (a) ν1d x1 (b) ν3e x3 fa f b fc ν ()x ν ()x ()x ()x ()x 1d 1 3e 3 µa1 1 µb2 2 (a) µc3 3 fa (b)f b fc (a) (b) ν1d ()x1 ν3e ()x3 µa1 ()x1 µb2()x2 µc3 ()x3 (c) (d) ν1d ()x1 ν3e ()x3 µa1 ()x1 µb2()x2 µc3 ()x3 (c) (d) µd2()x2 µe 2()x2 ν2d ()x2 ν2e ()x2 µa1 ()x1 µb2()x2 µc3 ()x3 (c) (d) µd2()x2 µe 2()x2 ν2d ()x2 ν2e ()x2 ν2b ()x2 (c) (d) µd2()x2 µe 2()x2 ν2d ()x ν2e ()x ν2b ()x2 2 2 (e) (f) µd2()x2 µe 2()x2 ν2d ()x2 ν2e ()x2 ν2b ()x2 (e) (f) µd1()x1 µe3 ()x3 ν2b ()x2 (e) (f) µd1()x1 µe3 ()x3 ν1a ()x1 ν3c ()x3 (e) (f) µd1()x1 µe3 ()x3 ν1a ()x1 ν3c ()x3 (g) (h) µd1()x1 µe3 ()x3 ν1a ()x1 ν3c ()x3 (g) (h) ν1a ()x1 ν3c ()x3 (g) (h) (g) (h) Sum-Product for “Nearly” Trees Undirected Pairwise Graphical Model Factor Graph Graphical Model via Auxiliary Variable X1 X1 X1 X 2 X 2 X 2 Z XX 3 4 XX 3 4 XX3 4 X 5 X6 X 5 X6 X 5 X6 (a) (b) (c) • Sum-product algorithm computes exact marginal distributions for any factor graph which is tree-structured (no cycles) • This includes some undirected graphs with cycles Sum-Product for Polytrees Undirected Directed Directed Factor Graph Tree Tree Polytree X1 X 2 X1 X 2 X 4 X 3 XX3 4 (a) (b) (c) X 5 X 5 (a) (b) • Early work on belief propagation (Pearl, 1980’s) focused on directed graphical models, and was complicated by directionality of edges and multiple parents (polytrees) • Factor graph framework makes this a simple special case Learning Directed Graphical Models X1 X X1 X1 1 p(x)= p(x x , ✓ ) i | Γ(i) i i Y2V X 2 X 3 X 2 X 3 Intuition: Must learn a X 2 X 3 good predictive model of each node, given X 4 its parent nodes X 4 • Directed(a) factorization causes (b) likelihood to locally decompose: • Often paired with a correspondingly factorized prior: p(✓)=p(✓1)p(✓2)p(✓3)p(✓4) log p(✓) = log p(✓1) + log p(✓2) + log p(✓3) + log p(✓4) Complete Observations X1 X1 X1,1 X1,1 X1,2 X1,2 X1,N X1,N X1 X1,1 X1,2 X1,N X 2 X 2 X 2,1 X 2,1 X 2,2 X 2,2 X 2,N X 2,N X 2 X 2,1 X 2,2 X 2,N X X X X X 3 3 X 3,1 3,1 X 3,2 3,2 X 3,N 3,N X 3 N X 3,1 X 3,2 X 3,N Directed(a) (a) N Independent,(b) (b) Identically (a) (b) Graphical Model Distributed Training Examples Plate Notation • A directed graphical model encodes assumed statistical dependencies among the different parts of a single training example: N p( ✓)= p(xi,n xΓ(i),n, ✓i) = x ,1,...,x ,N D| | D { V V } n=1 i Y Y2V • Given N independent, identically distributed, completely observed samples: N N log p( ✓)= log p(x x , ✓ )= log p(x x , ✓ ) D| i,n | Γ(i),n i i,n | Γ(i),n i n=1 i i "n=1 # X X2V X2V X Priors and Tied Parameters N N log p( ✓)= log p(x x , ✓ )= log p(x x , ✓ ) D| i,n | Γ(i),n i i,n | Γ(i),n i n=1 i i "n=1 # X X2V X2V X log p(✓)= p(✓i) A “meta-independent” factorized prior i X2V • Factorized posterior allows independent learning for each node: N log p(✓ )=C + log p(✓ )+ log p(x x , ✓ ) |D i i,n | Γ(i),n i i " n=1 # X2V X • Learning remains tractable when subsets of nodes are “tied” to use identical, shared parameter values: N log p( ✓)= log p(x x , ✓ ) D| i,n | Γ(i),n bi i "n=1 # X2V X N log p(✓ )=C + log p(✓ )+ log p(x x , ✓ ) b |D b i,n | Γ(i),n b i b =b n=1 |Xi X Example: Temporal Models z0 z1 z2 z3 z4 z5 Sequence 1 x1 x2 x3 x4 x5 z0 z1 z2 z3 Sequence 2 x1 x2 x3 z0 z1 z2 z3 z4 Sequence 3 x1 x2 x3 x4 N Tn p(x, z ✓)= p(zt,n zt 1,n, ✓time)p(xt,n zt,n, ✓obs) | | − | n=1 t=1 Y Y Learning Binary Probabilities Bernoulli Distribution: Single toss of a (possibly biased) coin Ber(x ✓)=✓I(x=1)(1 ✓)I(x=0) 0 ✓ 1 | − • Suppose we observe N samples from a Bernoulli distribution with unknown mean: X Ber(✓),i=1,...,N i ⇠ p(x ,...,x ✓)=✓N1 (1 ✓)N0 1 N | − • What is the maximum likelihood parameter estimate? N ✓ˆ = arg max log p(x ✓)= 1 ✓ | N Beta Distributions Probability density function: x [0, 1] 2 Γ(x + 1) = xΓ(x) Γ(k)=(k 1)! − Beta Distributions a ab E[x]= V[x]= a + b (a + b)2(a + b + 1) a 1 Mode[x] = arg max Beta(x a, b)= − x [0,1] | (a 1) + (b 1) 2 − − Bayesian Learning of Probabilities Bernoulli Likelihood: Single toss of a (possibly biased) coin Ber(x ✓)=✓I(x=1)(1 ✓)I(x=0) 0 ✓ 1 | − p(x ,...,x ✓)=✓N1 (1 ✓)N0 1 N | − Beta Prior Distribution: a 1 b 1 p(✓) = Beta(✓ a, b) ✓ − (1 ✓) − | / − Posterior Distribution: N +a 1 N +b 1 p(✓ x) ✓ 1 − (1 ✓) 0 − Beta(✓ N + a, N + b) | / − / | 1 0 • This is a conjugate prior, because posterior is in same family • Estimate by posterior mode (MAP) or mean (preferred) • Here, posterior predictive equivalent to mean estimate Sequence