<<

PROBABILISTIC MODELS FOR STRUCTURED DATA

04:

Instructor: Yizhou Sun [email protected]

January 29, 2019 Content • From Sequence to Graph: Markov Random Field • Inference • VE • Belief Propagation • Loopy Belief Propagation • Learning • Exponential Family • The Learning Framework • Summary

2 From Sequence to Graph

• Dependency exists among data points, which form a graph • Examples • Image semantic labeling (a regular graph) • User profiling in social network (a general graph)

e.g., Friends tend to have the same voting

preference: Democratic vs. Republican 3 Motivating Example • Modeling voting preference among persons • Person: A, B, C, D • Each person can take a binary value • 1: democratic • 0: republican • Friendship: (A,B), (B,C), (C,D), and (D,A) • Friends tend to take the same value • Indicated by a factor ( , )—assign higher score to consistent votes among friends 𝜙𝜙 𝑋𝑋 𝑌𝑌

• The joint probability • , , , = ( , ) ( , ) ( , ) ( , ) 1 = ( , ) ( , ) ( , ) ( , ) 𝑃𝑃 𝐴𝐴, ,𝐵𝐵, 𝐶𝐶 𝐷𝐷 𝑍𝑍 𝜙𝜙 𝐴𝐴 𝐵𝐵 𝜙𝜙 𝐵𝐵 𝐶𝐶 𝜙𝜙 𝐶𝐶: 𝐷𝐷normalization𝜙𝜙 𝐷𝐷 𝐴𝐴 constant 4 𝑍𝑍 ∑𝐴𝐴 𝐵𝐵 𝐶𝐶 𝐷𝐷 𝜙𝜙 𝐴𝐴 𝐵𝐵 𝜙𝜙 𝐵𝐵 𝐶𝐶 𝜙𝜙 𝐶𝐶 𝐷𝐷 𝜙𝜙 𝐷𝐷 𝐴𝐴 Motivating Example (Cont.)

• Given the model, some interesting questions can be asked: • What will be the most likely vote assignment such that the joint probability is maximized? • If we know A is republican and C is democratic, what will be the most likely vote for B and D? • For the model, how can we learn the parameters, i.e., the score to each possible factor configuration? • E.g., = = 1, = 1 =?

𝛽𝛽11 𝜙𝜙 𝑋𝑋 𝑌𝑌 5 Formal Definition • A Markov Random Field (MRF) is a P over random variables , , … , defined by a undirected graph G 𝑋𝑋1 𝑋𝑋2 𝑋𝑋𝑛𝑛 • , , … , = ( ) Gibbs Distribution 1 • C: cliques in graph G 1 2 𝑛𝑛 𝑍𝑍 𝑐𝑐∈𝐶𝐶 𝑐𝑐 𝑐𝑐 𝑃𝑃• 𝑥𝑥 ( 𝑥𝑥): factor𝑥𝑥 or potential∏ function𝜙𝜙 𝑥𝑥 • = , ,…, ( ) : partition function 𝜙𝜙𝑐𝑐 𝑥𝑥𝑐𝑐 • 1 2 𝑛𝑛 ( ) log𝑍𝑍-linear∑𝑥𝑥 𝑥𝑥 form𝑥𝑥 ∏ (if𝑐𝑐∈𝐶𝐶 𝜙𝜙𝑐𝑐 𝑥𝑥𝑐𝑐 >0) • , , … , = exp𝑐𝑐 ( 𝑐𝑐 ( )) = 1 𝜙𝜙 𝑥𝑥 exp( ( )) 𝑃𝑃 𝑥𝑥1 𝑥𝑥2 𝑥𝑥𝑛𝑛 𝑍𝑍 ∑𝑐𝑐∈𝐶𝐶 𝑙𝑙𝑙𝑙𝑙𝑙𝜙𝜙𝑐𝑐 𝑥𝑥𝑐𝑐 1 • = 𝑍𝑍 − ∑𝑐𝑐∈𝐶𝐶 𝐸𝐸𝑐𝑐 𝑥𝑥𝑐𝑐 • = ( ): energy function (the lower the better) 𝐸𝐸𝑐𝑐 𝑥𝑥𝑐𝑐 −𝑙𝑙𝑙𝑙𝑙𝑙𝜙𝜙𝑐𝑐 𝑥𝑥𝑐𝑐 𝑈𝑈 𝑥𝑥 ∑𝑐𝑐∈𝐶𝐶 𝐸𝐸𝑐𝑐 𝑥𝑥𝑐𝑐 6 Graphical Representation of MRFs

• = , • = {1,2, … , }, a set of random variables 𝐺𝐺• , 𝑉𝑉 𝐸𝐸 : , (or appear in the same𝑉𝑉 factor)𝑁𝑁 𝑖𝑖 𝑗𝑗 • Neighbor:𝑖𝑖 𝑗𝑗 ∈ 𝐸𝐸 ⟺ ∃𝑐𝑐 =𝑖𝑖 {𝑗𝑗 : ⊂, 𝑐𝑐 𝑋𝑋} 𝑎𝑎𝑎𝑎𝑎𝑎 𝑋𝑋 • Example 𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑖𝑖 𝑗𝑗 ∈ 𝐸𝐸 • , , … , , , , , , , , ( , ) 𝑃𝑃 𝐴𝐴 𝐵𝐵 𝐻𝐻 ∝ 𝜙𝜙1 𝐴𝐴 𝐵𝐵 𝐶𝐶 𝜙𝜙2 𝐵𝐵 𝐷𝐷 𝐸𝐸 𝜙𝜙3 𝐴𝐴 𝐺𝐺 𝜙𝜙4 𝐶𝐶 𝐹𝐹 𝜙𝜙5 𝐺𝐺 𝐻𝐻 𝜙𝜙6 𝐹𝐹 𝐻𝐻

7 Conditional Independence in MRF • Key properties • Global Markov property • For sets of nodes A, B, and C, | , iff C separates A from B in the graph 𝐴𝐴 𝐺𝐺 𝐵𝐵 𝐶𝐶 • E.g., 1,2 {6,7}|{3,4,5} 𝑿𝑿 ⊥ 𝑿𝑿 𝑿𝑿 • Local Markov⊥ property • A set of nodes are independent from the rest of the nodes in the graph, given its immediate neighbors (Markov blanket) • E.g., 1 {4,5,6,7}|{2,3} • Pairwise ⊥Markov property • Two nodes in the network that are not directly connected can be made independent given all other nodes • E.g., 1 {7}|{2,3,4,5,6}

⊥ 8 Hammersley-Clifford Theorem

> 0 & satisfy the conditional independence properties of an 𝑝𝑝undirected𝒙𝒙 graph G

can be factorized by a product of factors, one per maximal clique: i.e., 𝑝𝑝 𝒙𝒙 , , … , = ( ) 1 𝑃𝑃 𝑥𝑥1 𝑥𝑥2 𝑥𝑥𝑛𝑛 𝑍𝑍 ∏𝑐𝑐∈𝐶𝐶 𝜙𝜙𝑐𝑐 𝑥𝑥𝑐𝑐

9 Examples of MRFs: Discrete MRFs

• Nodes are discrete random variables • Factor or clique potentials: given a configuration of random variables in a clique, assign a real number to it • E.g., two variables A and B with values , , { , } in the same clique c, a possible potential function can be defined as: 𝑎𝑎1 𝑎𝑎2 𝑎𝑎3 𝑎𝑎𝑎𝑎𝑎𝑎 𝑏𝑏1 𝑏𝑏2 Scope of a factor: the set of the variables defining the factor, e.g., {A,B} in this case

10 Examples of MRFs: Gaussian MRFs • Gaussian MRFs • Nodes are continuous random variables

• Precision matrix: H = • Variables in x are connected− in1 the network only if they have a nonzero entry in theΣ precision matrix • Potentials are defined over cliques of edges • For an edge , , the potential is defined as exp{ } 1 𝑖𝑖 𝑗𝑗 − 2 𝐻𝐻𝑖𝑖𝑖𝑖 𝑥𝑥𝑖𝑖 − 𝜇𝜇𝑖𝑖 𝑥𝑥𝑗𝑗 − 𝜇𝜇𝑗𝑗 11 Examples of MRFs: Pairwise MRF

• A very simple factorization • Only considers factors on vertexes and edges

• , , … , = , , 1 • E.g., Ising model: mathematical model of ferromagnetism 1 2 𝑛𝑛 𝑍𝑍 𝑖𝑖∈𝑉𝑉 𝑖𝑖 𝑖𝑖 𝑖𝑖 𝑗𝑗 ∈𝐸𝐸 𝑖𝑖𝑖𝑖 𝑖𝑖 𝑗𝑗 𝑝𝑝 in𝑥𝑥 statistical𝑥𝑥 𝑥𝑥mechanics∏ 𝜙𝜙 𝑥𝑥 ∏ 𝜙𝜙 𝑥𝑥 𝑥𝑥 • Each atom takes two discrete values {+1, -1} • 𝑥𝑥𝑖𝑖 ∈

• > 0, ferromagnetic • < 0, antiferromagnetic 𝑤𝑤𝑖𝑖𝑖𝑖 𝑤𝑤𝑖𝑖𝑖𝑖

12 Content • From Sequence to Graph: Markov Random Field • Inference • VE • Belief Propagation • Loopy Belief Propagation • Learning • Exponential Family • The Learning Framework • Summary

13 Inference Problems • Marginal distribution inference • What is the marginal probability of a ? • E.g., what is the marginal probability of one person’s voting preference𝑃𝑃 being𝑥𝑥𝑖𝑖 “republican”? • What is the marginal probability of a random variable conditioning on some variables are observed | = 1 ? • E.g., what is the marginal probability of one person’s voting preference𝑃𝑃 given𝑥𝑥𝑖𝑖 𝑥𝑥 some𝑗𝑗 people’s voting preference observed? • Maximum a posterior inference (MAP) • What is the most likely assignment of a set of random variables (possibly with some evidence)? • E.g., what are the most likely voting preferences for everyone in a social network?

14 Inference Methods

• Marginal inference • Variable elimination • Belief propagation: sum-product message passing • MAP inference • Belief propagation: max-product message passing

15 An Illustrative Example

• Consider the marginal inference problem in Markov chain • , … , = ( | ) • Assuming each variable takes d discrete values 𝑃𝑃 𝑥𝑥1 𝑥𝑥𝑛𝑛 𝑃𝑃 𝑥𝑥1 ∏𝑡𝑡=2 𝑃𝑃 𝑥𝑥𝑡𝑡 𝑥𝑥𝑡𝑡−1 • How to compute marginal probability of ? • = ( , … , ) = 𝑛𝑛 ( | ) 𝑃𝑃 𝑥𝑥 𝑃𝑃 𝑥𝑥𝑛𝑛 ∑𝑥𝑥1 ⋯ ∑𝑥𝑥𝑛𝑛−1 𝑃𝑃 𝑥𝑥1 𝑥𝑥𝑛𝑛 =∑𝑥𝑥1 ⋯ ∑𝑥𝑥𝑛𝑛−1 𝑃𝑃 𝑥𝑥1 ∏𝑡𝑡=2(𝑃𝑃 𝑥𝑥𝑡𝑡 |𝑥𝑥𝑡𝑡−1 ) ( | ) � 𝑃𝑃 𝑥𝑥𝑛𝑛 𝑥𝑥𝑛𝑛−1 � 𝑃𝑃 𝑥𝑥𝑛𝑛−2 𝑥𝑥𝑛𝑛−3 ⋯ � 𝑃𝑃 𝑥𝑥1 𝑃𝑃 𝑥𝑥2 𝑥𝑥1 𝑛𝑛−1 𝑛𝑛−2 Intermediary1 factor Familiar𝑥𝑥 procedure? 𝑥𝑥 𝑥𝑥 𝜏𝜏 𝑥𝑥2 from ( ) Intermediary factor 𝑛𝑛 2 16 𝑂𝑂 𝑑𝑑 𝑡𝑡𝑡𝑡 𝑂𝑂 𝑑𝑑 𝑛𝑛 𝜏𝜏 𝑥𝑥𝑛𝑛−1 General VE

• Given an MRF (or other )

• , … , = ( ) 1 • 𝑃𝑃Compute𝑥𝑥1 𝑥𝑥 𝑛𝑛unnormalized𝑍𝑍 ∏𝑐𝑐∈𝐶𝐶 𝜙𝜙𝑐𝑐 marginal,𝑥𝑥𝑐𝑐 then normalize it after computation • , … , = ( ) • Compute by VE 𝑃𝑃� 𝑥𝑥1 𝑥𝑥𝑛𝑛 ∏𝑐𝑐∈𝐶𝐶 𝜙𝜙𝑐𝑐 𝑥𝑥𝑐𝑐 • Normalize𝑃𝑃 �marginal:𝑥𝑥𝑖𝑖 = 1 𝑃𝑃 𝑥𝑥𝑖𝑖 𝑍𝑍 𝑃𝑃� 𝑥𝑥𝑖𝑖

17 Factor Operations • Factor product • : = ×

• 3 1= 2 × 𝜙𝜙 𝜙𝜙 𝜙𝜙 1 2 •3Scope𝑐𝑐 of 1is the𝑐𝑐 union of 2 scopes𝑐𝑐 of and 𝜙𝜙 (𝑥𝑥) 𝜙𝜙 𝑥𝑥 𝜙𝜙 𝑥𝑥 • denotes𝜙𝜙3 an assignment to the variables𝜙𝜙1 in 𝜙𝜙the2 scope of by restricting𝑖𝑖 in that scope 𝑥𝑥𝑐𝑐 𝜙𝜙𝑖𝑖 • E.g., , , : = , × ( , ) 𝑥𝑥𝑐𝑐 𝜙𝜙3 𝑎𝑎 𝑏𝑏 𝑐𝑐 𝜙𝜙1 𝑎𝑎 𝑏𝑏 𝜙𝜙2 𝑏𝑏 𝑐𝑐

18 Factor Operations (Cont.)

• Factor marginalization • Locally eliminates a set of variables from a factor • E.g., = ( , )

𝜏𝜏 𝑥𝑥 ∑𝑦𝑦 𝜙𝜙 𝑥𝑥 𝑦𝑦

marginalizing out variable B from factor ( , , )) 19 𝜙𝜙 𝐴𝐴 𝐵𝐵 𝐶𝐶 The Variable Elimination Algorithm

• Given an eliminating order • Deciding the best order is an NP-hard problem • The algorithm • For each random variable following the given order 𝑖𝑖 1. Multiply all factors containing𝑋𝑋 2. Marginalize out to obtain a new factor 𝜙𝜙 𝑋𝑋𝑖𝑖 3. Replace the factors containing by 𝑋𝑋𝑖𝑖 𝜏𝜏 𝑋𝑋𝑖𝑖 𝜏𝜏

20 Example of VE , , … , , , , , , , , , 𝑃𝑃 𝐴𝐴 𝐵𝐵 𝐻𝐻 1 2 3 4 5 6 •∝Compute𝜙𝜙 𝐴𝐴 𝐵𝐵 𝐶𝐶(𝜙𝜙) 𝐵𝐵 𝐷𝐷 𝐸𝐸 𝜙𝜙 𝐴𝐴 𝐺𝐺 𝜙𝜙 𝐶𝐶 𝐹𝐹 𝜙𝜙 𝐺𝐺 𝐻𝐻 𝜙𝜙 𝐹𝐹 𝐻𝐻 • Eliminate in the order E, D, H, F, G, C, A 𝑃𝑃 𝐵𝐵

21 22 23 24 G

25 26 27 Question

• What can we obtain by ?

∑𝐵𝐵 𝑃𝑃� 𝐵𝐵

28 Introducing Evidence

• What if some variables are observed? P(Y, E = e) • P(Y E = e) = P(E=e) • E.g., ∣ = , = • Computation flow 𝑃𝑃 𝐵𝐵 𝐴𝐴 𝑎𝑎1 𝐶𝐶 𝑐𝑐2 • Perform variable elimination on P(Y, E = e) • For every factor that has variables in E, specify their values as e • Perform variable elimination on P(E = e)

29 Running Time

• Time complexity • m: number of variables𝑀𝑀 𝑂𝑂 𝑚𝑚𝑑𝑑 • d: number of states for each variable • M: maximum size of any factor during the elimination process

e.g., size of , , is 3, we need to go through all the configuration𝜙𝜙 (3*2*2=12)𝐴𝐴 𝐵𝐵 𝐶𝐶 to get ,

𝜏𝜏 𝐴𝐴 𝐶𝐶

30 Elimination Orderings

• NP-hard problem • Some useful heuristics • Min-neighbors: Choose a variable with the fewest dependent variables. • Min-weight: Choose variables to minimize the product of the cardinalities of its dependent variables. • Min-fill: Choose vertices to minimize the number of edges that will be added to the graph.

31 Content • From Sequence to Graph: Markov Random Field • Inference • VE • Belief Propagation • Loopy Belief Propagation • Learning • Exponential Family • The Learning Framework • Summary

32 Belief Propagation

• Limitation of VE • Each running of VE can only address one query • E.g., ( ) ( ) need two runnings

• Can we𝑃𝑃 share𝑌𝑌1 𝑎𝑎𝑎𝑎𝑎𝑎 intermediate𝑃𝑃 𝑌𝑌2 factors while computing? • Belief Propagation: Variable elimination as message passing

33 In the Case of Structure

• Compute marginal probability ( ) in a tree structure 𝑝𝑝 𝑥𝑥𝑖𝑖 • Tree: no cycles (acyclic connected graph) • The optimal order: • Set as root • Traverse the nodes via postorder 𝑥𝑥𝑖𝑖 • Starts from leaf nodes, go up the tree after its children have been visited (left, right, up)

Postorder: 4 5 2 3 1

34 In the Case of Tree Structure (Cont.) • At each step, eliminate following the proposed order • Suppose parent node𝑥𝑥𝑗𝑗 is = ′ , ( ) • , 𝑥𝑥𝑗𝑗 𝑠𝑠 𝑥𝑥𝑘𝑘 • The size is 2 𝑗𝑗 𝑘𝑘 𝑘𝑘 ∑𝑥𝑥𝑗𝑗 𝑘𝑘 𝑗𝑗 𝑗𝑗 𝑗𝑗 • Example:𝜏𝜏 𝑥𝑥 compute𝜙𝜙 𝑥𝑥 𝑥𝑥( 𝜏𝜏) 𝑥𝑥 • Postorder: , , , , • Eliminate : 𝑝𝑝= 𝑥𝑥3 ( , ) × 1 2 1 4 5 3 • Eliminate 𝑥𝑥: 𝑥𝑥 𝑥𝑥 =𝑥𝑥 𝑥𝑥 ( , ) × 2 21 1 ∑𝑥𝑥2 1 2 • Eliminate 𝑥𝑥 : 𝜏𝜏 𝑥𝑥 = 𝜙𝜙(𝑥𝑥 ,𝑥𝑥 ) × 1 1 13 3 ∑𝑥𝑥1 3 1 21 1 • Eliminate 𝑥𝑥 : 𝜏𝜏 𝑥𝑥 = 𝜙𝜙(𝑥𝑥 ,𝑥𝑥 ) × 𝜏𝜏1 𝑥𝑥 𝑥𝑥4 𝜏𝜏43 𝑥𝑥3 ∑𝑥𝑥4 𝜙𝜙 𝑥𝑥4 𝑥𝑥3 • = × × 𝑥𝑥5 𝜏𝜏53 𝑥𝑥3 ∑𝑥𝑥5 𝜙𝜙 𝑥𝑥5 𝑥𝑥3 𝑝𝑝 𝑥𝑥3 𝜏𝜏13 𝑥𝑥3 𝜏𝜏43 𝑥𝑥3 𝜏𝜏53 𝑥𝑥3 35 Message Passing View • When is marginalized out, it receives all the signals from variables underneath it from the tree 𝑥𝑥𝑗𝑗 • Which can be summarized as a factor ( ) • ( ) can be considered as a message that 𝜏𝜏𝑗𝑗 𝑥𝑥𝑗𝑗 sends to 𝑗𝑗 𝑗𝑗 𝑗𝑗 • 𝜏𝜏 𝑥𝑥 𝑥𝑥 receives𝑘𝑘 messages from all its immediate children to𝑥𝑥 obtain the final marginal 𝑖𝑖 • What𝑥𝑥 if we change the root of the tree, i.e., to compute the marginal of a different variable? • Do we need to re-compute messages?

36 The Message-Passing Algorithm

• How do we compute all the messages we need? • A node sends a message to a neighbor

whenever𝑖𝑖 it has received messages from all𝑗𝑗 nodes besides𝑥𝑥 𝑥𝑥 • 2 , 𝑥𝑥𝑗𝑗 • Each edge will receive messages twice: and 𝐸𝐸 𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠𝑠 𝑤𝑤푤𝑤𝑤𝑤𝑤𝑤𝑤 𝐸𝐸 𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡� 𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛𝑛 𝑜𝑜𝑜𝑜 𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒𝑒 𝑥𝑥𝑖𝑖 → 𝑥𝑥𝑗𝑗 𝑥𝑥𝑗𝑗 → • These𝑥𝑥𝑖𝑖 messages are intermediate factors in the VE algorithm

37 Sum-Product Message Passing • While there is a node ready to send message to (meaning has received all messages) 𝑥𝑥𝑖𝑖 • = , ( ) 𝑥𝑥𝑗𝑗 𝑥𝑥𝑖𝑖 ( )\ • in previous example 𝑚𝑚𝑖𝑖→𝑗𝑗 𝑥𝑥𝑗𝑗 ∑𝑥𝑥𝑖𝑖 𝜙𝜙 𝑥𝑥𝑖𝑖 𝜙𝜙 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 ∏𝑘𝑘∈𝑁𝑁 𝑖𝑖 𝑗𝑗 𝑚𝑚𝑘𝑘→𝑖𝑖 𝑥𝑥𝑖𝑖 • The𝜏𝜏 𝑖𝑖marginal𝑖𝑖 𝑥𝑥𝑗𝑗 probability can be computed

• ( ) ( ) ( )

𝑝𝑝 𝑥𝑥𝑖𝑖 ∝ 𝜙𝜙 𝑥𝑥𝑖𝑖 ∏𝑘𝑘∈𝑁𝑁 𝑖𝑖 𝑚𝑚𝑘𝑘→𝑖𝑖 𝑥𝑥𝑖𝑖

38 Example of Sum-Product

• 8 messages: , , , , , , , 𝑚𝑚21 𝑥𝑥1 𝑚𝑚13 𝑥𝑥3 𝑚𝑚43 𝑥𝑥3 𝑚𝑚53 𝑥𝑥3 • Marginal probability 𝑚𝑚31 𝑥𝑥1 𝑚𝑚12 𝑥𝑥2 𝑚𝑚34 𝑥𝑥4 𝑚𝑚35 𝑥𝑥5 • E.g., = × × ( )

𝑝𝑝 𝑥𝑥1 𝜙𝜙 𝑥𝑥1 𝑚𝑚31 𝑥𝑥1 𝑚𝑚21 𝑥𝑥1

39 Max-Product Message Passing • Now let’s consider MAP inference • = arg max ( , … , ) ,…, (most probable assignment) ∗ 1 𝑛𝑛 𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 𝒙𝒙 𝑥𝑥1 𝑥𝑥𝑛𝑛 𝑝𝑝 𝑥𝑥 𝑥𝑥 • In a Markov chain model (can be considered as a chain MRF) • Replace the previous sum with max • = max max , ∗ = max max , 1 max𝑖𝑖=2 𝑖𝑖 , 𝑖𝑖−1 max , ( ) 𝑝𝑝� 𝑥𝑥1 ⋯ 𝑥𝑥𝑛𝑛 𝜙𝜙 𝑥𝑥 ∏ 𝜙𝜙 𝑥𝑥 𝑥𝑥 • 𝑛𝑛 𝑛𝑛−1 𝑛𝑛−1 𝑛𝑛−2 2 1 1 Keep𝑥𝑥𝑛𝑛 𝑥𝑥back𝑛𝑛−1 𝜙𝜙-pointers𝑥𝑥 𝑥𝑥 for𝑥𝑥 𝑛𝑛the−2 𝜙𝜙 best𝑥𝑥 assignment𝑥𝑥 ⋯ 𝑥𝑥 1of 𝜙𝜙 𝑥𝑥for𝑥𝑥 each𝜙𝜙 𝑥𝑥 assignment of • E.g., keep a back-pointer to the best assignment to for𝑥𝑥 each𝑖𝑖 assignment to 𝑥𝑥𝑖𝑖+1 𝑥𝑥1 𝑥𝑥2

40 Max-Product Message Passing (Cont.)

• In a tree MRF model • Replace sum with max in sum-product message passing

• = max , ( )\ ( )

𝑖𝑖→𝑗𝑗 𝑗𝑗 𝑖𝑖 𝑖𝑖 𝑗𝑗 𝑘𝑘→𝑖𝑖 𝑖𝑖 • 𝑚𝑚Max-marginal:𝑥𝑥 x i 𝜙𝜙 𝑥𝑥 𝜙𝜙 𝑥𝑥( 𝑥𝑥) ∏𝑘𝑘∈(𝑁𝑁)𝑖𝑖 𝑗𝑗 𝑚𝑚( )𝑥𝑥 • Backtracking 𝑝𝑝𝑚𝑚 𝑥𝑥𝑖𝑖 ∝ 𝜙𝜙 𝑥𝑥𝑖𝑖 ∏𝑘𝑘∈𝑁𝑁 𝑖𝑖 𝑚𝑚𝑘𝑘→𝑖𝑖 𝑥𝑥𝑖𝑖 • Keep back-pointer for best assignment to for each • From max-marginals, a mode can be determined by back- 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 tracking

Generalizing Vertibi Algorithm!

41 For a General Graph Structure

• Consider an MRF with pairwise potentials (pairwise MRF)

• , , … , , ( , )

• Loopy𝑝𝑝 𝑥𝑥1 belief𝑥𝑥2 𝑥𝑥propagation𝑛𝑛 ∝ ∏𝑖𝑖 𝜙𝜙 𝑥𝑥𝑖𝑖 (approximate∏ 𝑖𝑖 𝑗𝑗 ∈𝐸𝐸 𝜙𝜙 𝑥𝑥𝑖𝑖 𝑥𝑥 𝑗𝑗 algorithm) • Given an edge order • Perform message passing iteratively

• = , ( )\ ( ) 𝑡𝑡+1 𝑡𝑡 • Convergence𝑚𝑚𝑖𝑖→𝑗𝑗 𝑥𝑥𝑗𝑗 ∑ 𝑥𝑥not𝑖𝑖 𝜙𝜙 𝑥𝑥guaranteed.𝑖𝑖 𝜙𝜙 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 ∏𝑘𝑘∈ If𝑁𝑁 𝑖𝑖converges,𝑗𝑗 𝑚𝑚𝑘𝑘→𝑖𝑖 𝑥𝑥𝑖𝑖 good result

42 Other Inference Methods

• Junction Tree • Mean field • Sampling

43 Content • From Sequence to Graph: Markov Random Field • Inference • VE • Belief Propagation • Loopy Belief Propagation • Learning • Exponential Family • The Learning Framework • Summary

44 Exponential Family

• Canonical Form • ; = exp 𝑇𝑇 𝑝𝑝 𝑦𝑦 𝜂𝜂 𝑏𝑏 𝑦𝑦 𝜂𝜂 𝑇𝑇 𝑦𝑦 − 𝑎𝑎 𝜂𝜂 • : natural parameter • : sufficient statistic 𝜂𝜂 • : log partition function for normalization 𝑇𝑇 𝑦𝑦 • = log exp( ( )) (discrete case) 𝑎𝑎 𝜂𝜂 𝑇𝑇 • 𝑎𝑎 𝜂𝜂: function∑𝑦𝑦 𝑏𝑏 that𝑦𝑦 only𝜂𝜂 𝑇𝑇dependent𝑦𝑦 on 𝑏𝑏 𝑦𝑦 𝑦𝑦

45 Examples of Exponential Family

• ; = exp Many: 𝑇𝑇 • Gaussian, Bernoulli, Poisson,𝑝𝑝 𝑦𝑦 𝜂𝜂 beta,𝑏𝑏 𝑦𝑦 Dirichlet𝜂𝜂 𝑇𝑇 𝑦𝑦, categorical,− 𝑎𝑎 𝜂𝜂 … •For Gaussian (not interested in )

𝜎𝜎

•For Bernoulli

46 𝜂𝜂 Properties of Exponential Family

• log ; is concave with respect to • If a local maximum exists, then it is a global maximum𝑝𝑝 𝑦𝑦 𝜂𝜂 𝜂𝜂 • Concavity proof • log ; = log + log ; = 𝑇𝑇 = • 𝑝𝑝 𝑦𝑦 𝜂𝜂 𝑏𝑏 𝑦𝑦 𝜂𝜂 𝑇𝑇 𝑦𝑦 − 𝑎𝑎 𝜂𝜂 ~ • log ; = ( ( )) 𝛻𝛻𝜂𝜂 𝑝𝑝 𝑦𝑦 𝜂𝜂 𝑇𝑇 𝑦𝑦 − 𝛻𝛻𝜂𝜂𝑎𝑎 𝜂𝜂 𝑇𝑇 𝑦𝑦 − 𝐸𝐸𝑦𝑦 𝑝𝑝 𝑇𝑇 𝑦𝑦 •2Covariance matrix is always positive semi-definite 𝜂𝜂 𝛻𝛻• Therefore,𝑝𝑝 𝑦𝑦 second𝜂𝜂 −derivative𝑐𝑐𝑐𝑐𝑐𝑐 𝑇𝑇 (Hessian)𝑦𝑦 is always negative semi- definite, which implies the log likelihood function is concave with respect to

𝜂𝜂 47 The Learning Problem in MRF

• Consider an MRF with parameters

• , … , ; = ( ; ) ( ) 1 𝛽𝛽 • The𝑝𝑝 𝑥𝑥goal:1 𝑥𝑥find𝑛𝑛 𝛽𝛽 that𝑍𝑍 𝛽𝛽 can∏𝑐𝑐∈ maximize𝐶𝐶 𝜙𝜙𝑐𝑐 𝑥𝑥𝑐𝑐 𝛽𝛽 the likelihood function 𝛽𝛽 • Usually of multiple configurations • = { , , … , } 1 2 𝑁𝑁 • Computational𝐷𝐷 𝑥𝑥 𝑥𝑥 challenge𝑥𝑥 of gradient ascent method • Intractable to compute gradient

48 Reparametrizing

• , … , ; = exp( ( ; )) ( ) 1 1 𝑝𝑝 𝑝𝑝 =𝑥𝑥1 𝑥𝑥exp𝑛𝑛 𝛽𝛽( 𝑍𝑍 𝛽𝛽 1{ ∑=𝑐𝑐∈𝐶𝐶}𝑙𝑙𝑙𝑙𝑙𝑙𝜙𝜙𝑐𝑐 (𝑥𝑥𝑐𝑐 ;𝛽𝛽 )) ( ) ′ ′ 𝑐𝑐 𝑐𝑐 𝑐𝑐 𝑐𝑐 � �exp′ ( 𝑥𝑥 ( 𝑥𝑥)) 𝑙𝑙𝑙𝑙𝑙𝑙A special𝜙𝜙 case𝑥𝑥 of𝛽𝛽 exponential 𝑍𝑍 𝛽𝛽 𝑐𝑐∈=𝐶𝐶 𝑥𝑥𝑐𝑐 family, where b(x)=1 (𝑇𝑇 ) • : a possible assignment/configuration𝜽𝜽 𝒇𝒇 𝒙𝒙 for clique c • 1{′ = }: indicator function,𝑍𝑍 𝜽𝜽 1 if = holds; 0, 𝑐𝑐 otherwise𝑥𝑥 ′ ′ • (𝑥𝑥𝑐𝑐): a vector𝑥𝑥𝑐𝑐 of indicator functions𝑥𝑥𝑐𝑐 𝑥𝑥𝑐𝑐 • Size: (# ) • 𝒇𝒇: 𝒙𝒙a set of all the parameters, defined by ( ; ) ∑𝑐𝑐 𝑜𝑜𝑜𝑜 𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐𝑐 𝑓𝑓𝑓𝑓𝑓𝑓 𝑐𝑐 ′ 49 𝜃𝜃 𝑙𝑙𝑙𝑙𝑙𝑙𝜙𝜙𝑐𝑐 𝑥𝑥𝑐𝑐 𝛽𝛽 In a Pairwise MRF Case

• Consider a pairwise MRF, each variable taking possible values from {1, 2, … , } 𝑀𝑀 • , , … , , , Assuming every 𝑀𝑀 edge share the same type of factor 1 2 𝑛𝑛 𝑖𝑖 𝑗𝑗 ∈𝐸𝐸 𝑖𝑖 𝑗𝑗 𝑝𝑝 𝑥𝑥 𝑥𝑥 𝑥𝑥 = ∝exp∏( 𝜙𝜙 𝑥𝑥 (𝑥𝑥 , )) function: sharing , parameters � 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 = exp( 1{𝑖𝑖 𝑗𝑗 ∈=𝐸𝐸 , = } ( , )) , , � � 𝑥𝑥𝑖𝑖 𝑚𝑚 𝑥𝑥𝑗𝑗 𝑛𝑛 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑚𝑚 𝑛𝑛 = exp( 𝑖𝑖 𝑗𝑗 ∈𝐸𝐸 𝑚𝑚 𝑛𝑛 1{ = , = } ( , )) , , 𝑖𝑖 𝑗𝑗 � � (x):𝑥𝑥 # of edges𝑚𝑚 𝑥𝑥 with 𝑛𝑛 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑚𝑚 𝑛𝑛 𝑚𝑚 𝑛𝑛 𝑖𝑖 𝑗𝑗 ∈𝐸𝐸 configuration ( , ) 𝑚𝑚𝑚𝑚 𝑓𝑓 𝑚𝑚𝑚𝑚 𝑚𝑚 𝑛𝑛 𝜃𝜃 50 Example

• Consider a pairwise MRF • Each site can take only two values {0,1} • For each pair, 4 possible configurations • 00, 01,10,11 • Parameter matrix: exp( ) exp( ) = • exp( ) exp( ) 𝛽𝛽00 𝛽𝛽01 𝜃𝜃00 𝜃𝜃01 • E.g., 0, 0 = = exp( ) 𝛽𝛽10 𝛽𝛽11 𝜃𝜃10 𝜃𝜃11 𝜙𝜙 𝛽𝛽00 𝜃𝜃00

51 Properties under the New Form

• log ; = = ~ 𝛻𝛻𝜽𝜽 𝑝𝑝 𝒙𝒙 𝜽𝜽 𝒇𝒇 𝒙𝒙 − 𝛻𝛻𝜃𝜃𝑙𝑙𝑙𝑙𝑙𝑙𝑍𝑍 𝜽𝜽 𝒇𝒇 𝒙𝒙 − • First derivative is expensive to compute 𝐸𝐸𝒙𝒙 𝑝𝑝 𝒇𝒇 𝒙𝒙 • Remember ~ involves enumerate over all the possible 𝐸𝐸𝒙𝒙 𝑝𝑝 𝒇𝒇 𝒙𝒙 • ~ = ( ) 𝒙𝒙 • log𝐸𝐸𝒙𝒙 𝑝𝑝 𝒇𝒇 𝒙𝒙; ∑=𝒙𝒙 𝑝𝑝 𝒙𝒙 𝒇𝒇 𝒙𝒙( ) • 2Concave 𝛻𝛻𝜽𝜽 𝑝𝑝 𝒙𝒙 𝜽𝜽 −𝑐𝑐𝑐𝑐𝑐𝑐 𝒇𝒇 𝒙𝒙 • Any local optimal point is global optimal

52 Approximate Learning

• Pseudo-likelihood

• Gibbs sampling-based gradient ascent* • Gibbs sampling is one of the Markov Chain Monte Carlo methods (MCMC)

53 Pseudo-Likelihood

• Approximate the likelihood with pseudo-likelihood • log ; log ( | ; ) • is the neighbors of in the graph 𝑝𝑝 𝒙𝒙 𝜽𝜽 ≈ ∑𝑖𝑖 𝑝𝑝 𝑥𝑥𝑖𝑖 𝑥𝑥𝑁𝑁 𝑖𝑖 𝜽𝜽 • Pseudo𝑁𝑁 𝑖𝑖 -likelihood is not𝑖𝑖 equal to likelihood • Apply chain rule to obtain the exact factorization log ; = ( ) ( | , … , )

𝑝𝑝 𝒙𝒙 𝜽𝜽 𝑝𝑝 𝑥𝑥1 � 𝑝𝑝 𝑥𝑥𝑖𝑖 𝑥𝑥1 𝑥𝑥𝑖𝑖−1 • Pseudo-likelihood is tractable𝑖𝑖=1 • Pseudo-likelihood is concave • Sum of concave functions is concave • Works well in practice

54 More on the Conditional Distribution: Pairwise MRF Setting • ; = ; ( ) ( ) , ( , ) , & , ( , ) 𝑝𝑝 =𝑥𝑥𝑖𝑖 𝑥𝑥𝑁𝑁 𝑖𝑖 𝜽𝜽= 𝑝𝑝 𝑥𝑥𝑖𝑖 𝑥𝑥−𝑖𝑖 𝜽𝜽 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀𝑀 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 ( ) ( , ) ( , ) 𝑝𝑝 𝒙𝒙 ∏ 𝑖𝑖 𝑗𝑗 ,∈𝐸𝐸 𝜙𝜙 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 ∏ 𝑘𝑘 𝑗𝑗 ∈, 𝐸𝐸 𝑘𝑘&𝑗𝑗≠,𝑖𝑖 𝜙𝜙 𝑥𝑥𝑘𝑘 𝑥𝑥𝑗𝑗 ′ , ′ ( , ) ( ) ( , ) 𝑝𝑝=𝑥𝑥−𝑖𝑖 ∑𝑥𝑥𝑖𝑖 ∏ 𝑖𝑖 𝑗𝑗 ∈𝐸𝐸 𝜙𝜙 𝑥𝑥=𝑖𝑖 𝑥𝑥𝑗𝑗 ∏ 𝑘𝑘 𝑗𝑗 ∈𝐸𝐸 𝑘𝑘 𝑗𝑗≠𝑖𝑖 𝜙𝜙 𝑥𝑥𝑘𝑘 𝑥𝑥𝑗𝑗 ( , ) ( , ) ∏ 𝑖𝑖 𝑗𝑗 ,∈𝐸𝐸 𝜙𝜙 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 ∏𝑗𝑗∈𝑁𝑁 𝑖𝑖 (𝜙𝜙) 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 ′ ′ ′ ( ) ( , ) ′ = ∑𝑥𝑥𝑖𝑖 ∏ 𝑖𝑖 𝑗𝑗 ∈𝐸𝐸 𝜙𝜙 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 ∑𝑥𝑥𝑖𝑖 ∏𝑗𝑗∈𝑁𝑁 𝑖𝑖 𝜙𝜙 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 ( ) ∏𝑗𝑗∈𝑁𝑁 𝑖𝑖 𝜙𝜙 𝑥𝑥𝑖𝑖 𝑥𝑥𝑗𝑗 exp( ( ) , 1{ = , = } ( , ) = 𝑖𝑖 𝑚𝑚𝑚𝑚 𝑍𝑍 𝜃𝜃 ( ) 𝜃𝜃 ∑𝑗𝑗∈𝑁𝑁 𝑖𝑖 ∑𝑚𝑚 𝑛𝑛 𝑥𝑥𝑖𝑖 𝑚𝑚 𝑥𝑥𝑗𝑗 𝑛𝑛 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 𝑚𝑚 𝑛𝑛 ( ( | ; )) 𝑖𝑖 • = ( ) 1{𝑍𝑍 =𝜃𝜃 , = } 𝜕𝜕log 𝑝𝑝 𝑥𝑥𝑖𝑖 𝑥𝑥𝑁𝑁 𝑖𝑖 𝜃𝜃 1{ = , = } | 𝜕𝜕(𝜃𝜃)𝑚𝑚𝑚𝑚 ( ) ∑𝑗𝑗∈𝑁𝑁 𝑖𝑖 𝑥𝑥𝑖𝑖 𝑚𝑚 𝑥𝑥𝑗𝑗 𝑛𝑛 − 𝑥𝑥𝑖𝑖 𝑥𝑥𝑁𝑁 𝑖𝑖 𝑗𝑗∈𝑁𝑁 𝑖𝑖 𝑖𝑖 𝑗𝑗 𝐸𝐸 ∑ 𝑥𝑥 𝑚𝑚 𝑥𝑥 𝑛𝑛 55 Content • From Sequence to Graph: Markov Random Field • Inference • VE • Belief Propagation • Loopy Belief Propagation • Learning • Exponential Family • The Learning Framework • Summary

56 Summary • From Sequence to Graph: Markov Random Field • Inference • VE • Belief Propagation • Sum-product message passing; max-product message passing • Loopy Belief Propagation • For general graph • Learning • Exponential Family • MRF belongs to exponential family • The Learning Framework • Pseudo likelihood

57 References

• Stanford cs228 course: https://ermongroup.github.io/cs228-notes/ • Kevin P. Murphy, Machine Learning: A Probabilistic Perspective, 2012. • Martin Wainwrite, Graphical models, message-passing , and variational methods, https://people.eecs.berkeley.edu/~wainwrig/Talks/Wainwright_PartI. pdf • CMU 10-708 course: Probabilistic Graphical Models: https://www.cs.cmu.edu/~epxing/Class/10708- 14/scribe_notes/scribe_note_lecture13.pdf • J. Coughlan tutorial http://computerrobotvision.org/2009/tutorial_day/crv09_belief_pro pagation_v2.pdf • Amir Globerson MIT course: http://people.csail.mit.edu/dsontag/courses/pgm12/slides/pseudolik elihood_notes.pdf

58