CS242: Probabilistic Graphical Models Lecture 1B: Factor Graphs, Inference, & Variable Elimination

Professor Erik Sudderth Brown University Computer Science September 15, 2016

Some figures and materials courtesy David Barber, Bayesian Reasoning and Machine Learning http://www.cs.ucl.ac.uk/staff/d.barber/brml/ CS242: Lecture 1B Outline Ø Factor graphs Ø Loss Functions and Marginal Distributions Ø Variable Elimination Algorithms Factor Graphs set of hyperedges linking subsets of nodes f F ✓ V set of N nodes or vertices, 1, 2,...,N V 1 { } p(x)= f (xf ) Z = (x ) Z f f f x f Y2F X Y2F Z>0 normalization constant (partition function) (x ) 0 arbitrary non-negative potential function f f • In a , the hyperedges link arbitrary subsets of nodes (not just pairs) • Visualize by a , with square (usually black) nodes for hyperedges • Motivation: factorization key to computation • Motivation: avoid ambiguities of undirected graphical models Factor Graphs & Factorization For a given undirected graph, there exist distributions with equivalent Markov properties, but different factorizations and different inference/learning complexities.

1 p(x)= (x ) Z f f f Y2F

Undirected Pairwise (edge) Maximal Clique An Alternative Potentials Potentials Factorization 1 1 Recommendation: p(x)= (x ,x ) p(x)= (x ) Use undirected graphs Z st s t Z c c (s,t) c for pairwise MRFs, Y2E Y2C and factor graphs for higher-order potentials 6 K. Yamaguchi, T. Hazan, D. McAllester, R. Urtasun

> 2pixels > 3pixels > 4pixels > 5pixels Non-Occ All Non-Occ All Non-Occ All Non-Occ All GC+occ [36] 39.76 % 40.97 % 33.50 % 34.74 % 29.86 % 31.10 % 27.39 % 28.61 % OCV-BM [37] 27.59 % 28.97 % 25.39 % 26.72 % 24.06 % 25.32 % 22.94 % 24.14 % CostFilter [38] 25.85 % 27.05 % 19.96 % 21.05 % 17.12 % 18.10 % 15.51 % 16.40 % GCS [26] 18.99 % 20.30 % 13.37 % 14.54 % 10.40 % 11.44 % 8.63 % 9.55 % GCSF [39] 20.75 % 22.69 % 13.02 % 14.77 % 9.48 % 11.02 % 7.48 % 8.84 % SDM [27] 15.29 % 16.65 % 10.98 % 12.19 % 8.81 % 9.87 % 7.44 % 8.39 % ELAS [30] 10.95 % 12.82 % 8.24 % 9.95 % 6.72 % 8.22 % 5.64 % 6.94 % OCV-SGBM [25] 10.58 % 12.20 % 7.64 % 9.13 % 6.04 % 7.40 % 5.04 % 6.25 % ITGV [40] 8.86 % 10.20 % 6.31 % 7.40 % 5.06 % 5.97 % 4.26 % 5.01 % Ours 6.25 % 7.78 % 4.13 % 5.45 % 3.18 % 4.32 % 2.66 % 3.66 % Example: Nearest-NeighborTable 2. Comparison Spatial with the state MRF of the art on the test set of KITTI [2] Stereo Vision: Yamaguchi et al., ECCV 2012

2 Diverse M-Best Solutions in Markov Random Fields

Fig.input 4.ObjectKITTI MAP Segmentation: examples. mode (Left) Batra input Original.et al., MAP (Middle) ECCV mode 2012 Disparity. (Right) Disparity errors.

Compatibility potential: We introduce an additional potential which ensures that the discrete occlusion labels match well the disparity observations. We do so by penalizing occlusion boundaries that are not supported by the data

ˆ ˆ occ imp if p Bij : di(p, yfront) < dj(p, yback) • Observed nodes: Features of 2D imageij (yfront (intensity,, yback)= color, texture,9 2…) Fig. 1: Examples of category-level segmentation(0 results otherwise on test images from VOC 2010. • Hidden nodes: Property of 3DFor world each image, (depth, “mode” above motion, is the best ofobject 10 solutions category, obtained with Div…)MBest. Input MAPneg 2nd MAP 2nd Mode Input MAP Mode We also define ij (yi) to be a function which penalizes negative disparities

ˆ neg imp if minp Bij di(p, yi) < 0 ij (yi)= 2 (0 otherwise Fig. 2: (Left) Interactive segmentation setup; (Right) Pose tracking. “Mode” above refersWe to impose the solution a obtained regularization with DivMBest on. the type of occlusion boundary, where we prefer forsimpler a diverse explanations set of highly probable (i.e., solutions coplanar instead is from preferable our existing models. than hinge which is more desir- Even if the MAP solution alone is of poor quality, a diverse set of highly probably able than occlusion). We encode this preference by defining occ > hinge > 0. hypotheses might still enable accurate predictions. Note that in contrast to the M-BestWe thus MAP define problem [9 our, 22, 38 computability] that involves finding potential the top M most probable solutions under a probabilistic model, our approach emphasizes diversity.We seek to produce highly probable solutions thatneg are qualitativelyneg di↵erent fromocc the MAP and from each other. Thisocc is+ anij important(yi)+ distinctionij (yj because)+ij the(yi, yj )ifoij = lo literal definition of M-best MAP is not expectedneg to work wellneg in practice.occ In a occ + (yi)+ (yj )+ij (yj , yi)ifoij = ro largebdy2 state-space problem (e.g.282,000,000) anyij reasonable settingij of M (10 50) ij (oij, yi, yj)= neg neg 1 would return solutions nearly> identicalhinge to+ theij MAP.(yi Ideally,)+ij we( wouldyj )+ likeB to p B dij if oij = hi > | ij | 2 ij find the modes of the distribution<> learntneg by our probabilisticneg model.1 Figs. 1, 2 (yi)+ (yj )+ di,j if oij = co show examples of these diverse solutionsij extractedij for di↵erent tasks.S S p SiPSj | i[ j | 2 [ Overview. In this paper, we introduce the Diverse M-Best problem, which > P involves finding a diverseˆ set of:> highly probableˆ solutions2 under a discrete prob- abilisticwith model.di,j Our=( formulationdi(p, yi assumes) dj access(p, y toj a)) dissimilarity. function ( , ) measuring the di↵erence between two solutions. We present an integer program-· · ming formulation for the Diverse M-Best problem, study its Lagrangian relax- ation.Junction We show that Feasibility: this relaxation hasFollowing an interesting work interpretation on occlusion as the - boundary reasoning [15, augmented16], we energy utilize minimization higher problem, order which potentials minimizes a to linear encode combination whether a junction of three of the energy and similarity to previous solutions, thereby producing solutions with low energy and high diversity. For simplicity we sometimes will refer to these solutions as modes, although they are not guaranteed to be actual modes of the Gibbs distribution. Higher-Order MRF Potentials Higher-order models in Computer Vision Kohli & Rother 2012 7

Figure 1.2: Some qualitative results. Please view in colour. First Row: Original Image. Second Row: • Observed nodes: Features of 2D image (intensity,Unary likelihood color, labeling from TextonBoost texture, [34]. Third Row: …) Result obtained using a pairwise contrast preserving smoothness potential as described in [34]. Fourth Row: Result obtained using the Pn Potts model • Hidden nodes: Property of 3D world (depth, motion,potential [15]. Fifth Row: object Results using the Robustcategory, Pn model potential (1.7) with…) truncation parameter Q = 0.1 c , where c is equal to the size of the super-pixel over which the Robust Pn higher-order potential | | | | is defined. Sixth Row: Hand labeled segmentations. The ground truth segmentation are not perfect and many pixels (marked black) are unlabelled. Observe that the Robust Pn model gives best results. For instance, the leg of the sheep and bird have been accurately labeled which was missing in the other results.

Unlike the standard Pn Potts model, this potential function gives rise to a cost that is a linear truncated function of the number of inconsistent variables (see figure 1.1). This enables the robust potential to allow Low Density Parity Check (LDPC) Codes Parity Check Factors Ø Each variable node is binary, so xs 0, 1 Ø Parity check factors2 { } equal 1 if the sum of the connected bits is even, 0 if the sum is odd (invalid codewords Evidence (observation) Factors are excluded) Ø Unary evidence factors Data equal probability that Bits each bit is a 0 or 1, given data. Assumes independent “noise” on Parity each bit. Bits

Inference: 0 5 15 \ Figure 3: decoding of a (3,6) regular LDPC code, given a binary input message (top half) corrupted by noise with error probability ϵ =0.08 (as set up in [5]). We show the received codeword including parity bits (zero iterations, left), the bits that maximize the posterior marginals after five and fifteen BP iterations, and the final, correctly decoded signal after convergence to a loopy BP fixed point (the Stata center, home of the MIT Computer Science & Artificial Intelligence Laboratory). example systematic message encoding [5], in which the first half of the transmitted bits are the uncoded message of interest. To aid visualization, we choose this message to be a simple binary image, but the LDPC code does not model or exploit this spatial structure. To decode a noise–corrupted message via loopy BP, we pass messages in a parallel fashion. Each iteration begins by sending new messages from each variable node to all neighboring factor nodes, and then updates the outgoing messages from each parity check. As illustrated in Fig. 3, this allows information from un–violated parity checks to propagate throughout the message, gradually correcting noisy bits. These BP message updates continue until the thresholded marginal probabilities define a valid codeword, or until some maximum number of iterations is reached. In practice, BP decoding typically fails due to oscillations in the message update dynamics, rather than convergence to an incorrect codeword [5]. The outstanding empirical performance of message–passing decoders has inspired extensive research on alternatives to regular LDPC codes [5]. One widely used trick removes all cycles of length four from the factor graph, so that no two parity checks share a common pair of message bits. This reduces over–counting that can otherwise hamper loopy BP. In applications where transmitting million–bit blocks is feasible, extremely effective codes can be found by designing random ensembles of sparse parity check matrices whose typical members are good. For more moderate blocklengths, graphical code design is a topic of ongoing research. 4 Application: Dense Stereo Reconstruction To further illustrate the practical behavior of loopy BP, we now consider an application in spatial signal processing. Many low–level computer vision tasks have a common form: gather local evidence, then propagate that evidence across space [8]. Belief propagation provides machinery to solve such spatial inference problems in factor graphs. Binocular stereo reconstruction, in which images captured at two offset locations are used to estimate scene depths, is typical of

7 Directed Graphs as Factor Graphs Directed Graphical Model N p(x)= p(x x ) i | (i) i=1 Y Directed Graphs as Factor Graphs Directed Graphical Model N p(x)= p(x x ) i | (i) i=1 Y Corresponding Factor Graph N

p(x)= i(xi,x(i)) i=1 Y

• Associate one factor with each node, linking it to its parents and defined to equal the corresponding conditional distribution • Information lost: Directionality of conditional distributions, and fact that global partition function Z=1 50 CHAPTER 2.Directed NONPARAMETRIC AND Graphs GRAPHICAL MODELS as Undirected Graphs N N p(x)= (x ,x ) p(x)= p(x x ) i i (i) i | (i) i=1 i=1 Y Y x1 x2 x1 x2 x1 x2

x3 x3 x3

x4 x5 x4 x5 x4 x5

(a) (b) (c)

Figure 2.4. Three graphical representations of a distribution• Retain overall directed five random edges, variables but (see drop [175 ]).directionality (a) Directed graph G depicting a causal, generative• process.Create (b) Factormoral graph graph expressing by linkingthe factoriza-(“marrying”) pairs of nodes with a common child tion underlying G. (c) A “moralized” undirected graph• capturingThis undirected the Markov graph structure hasof aG clique. for conditional of each node given its parent

For example, in the factor graph of Fig. 2.5(c), there are 5 variable nodes, and the joint distribution has one potential for each of the 3 hyperedges: p(x) ψ (x ,x ,x ) ψ (x ,x ,x ) ψ (x ,x ) ∝ 123 1 2 3 234 2 3 4 35 3 5 Often, these potentials can be interpreted as local dependencies or constraints. Note, however, that ψf (xf ) does not typically correspond to the marginal distribution pf (xf ), due to interactions with the graph’s other potentials. In many applications, factor graphs are used to impose structure on an exponential family of densities. In particular, suppose that each potential function is described by the following unnormalized exponential form:

ψ (x θ )=ν (x )exp θ φ (x ) (2.67) f f | f f f ⎧ fa fa f ⎫ ⎨a$∈Af ⎬ Here, θ ! θ a are the canonical parameters⎩ of the local⎭ exponential family f { fa | ∈ Af } for hyperedge f. From eq. (2.66), the joint distribution can then be written as p(x θ)= ν (x ) exp θ φ (x ) Φ(θ) (2.68) | f f ⎧ fa fa f − ⎫ ( f)∈F * ⎨f$∈F a$∈Af ⎬ Comparing to eq. (2.1), we see that factor⎩ graphs define regular exponential⎭ fami- lies [104, 311], with parameters θ = θ f , whenever local potentials are chosen { f | ∈ F} from such families. The results of Sec. 2.1 then show that local , computed over the support of each hyperedge, are sufficient for learning from training data. This ExampleExample – – Part Part II: II: Specifying Specifying the the Tables Tables

B B E E

A A R R p(Ap(B,EA B,E) ) | | p(Rp(ER) E) AlarmAlarm = = 1 1 BurglarBurglarEarthquakeEarthquake | | 0.99990.9999Example1 1 – Part1 II:1 Specifying theRadio TablesRadio = 1 = 1EarthquakeEarthquake 0.990.99 1 1 0 0 1 1 1 1 Example – Part II: Specifying the Tables 0.990.99 0 0 Example1 Directed1 – Part I Graphs0 0 and0 0 Evidence 0.00010.0001 0 0 Sally’s0 burglar0 B Alarm isE sounding.Example Has she –been PartBurgled, II: or Specifying was the alarm the Tables triggered by an Earthquake? She turns the car Radio on for news of earthquakes. B E Joint distribution specifiedA by graphR and eight parameters: Choosing an ordering A R Without loss of generality, we can write TheThe remaining remaining tables tables are arepp((BAp(B,E=B 1)=) =1) 0 =.01 0.01andandp(Ep(=E 1)= = 1) 0 =.000001 0.000001.Thetables.ThetablesB E | andand graphical graphical structure structure fully fully specify specifypp((AA,B,E R,the) E, the distribution. B)= distribution.p(A R, E, B)pp((R,R E,E) B) Alarm = 1 Burglar | Earthquake | | = p(A R, E, B)p(Rp(E,BR E))p(E,B) A R 0.9999Alarm = 11 Burglar 1Earthquake | Radio = 1 | Earthquake| A 0.99 0.9999 1 1 0 1 = p(A R, E,Radio B1)p(R = 1E,BEarthquake)1p(E B)p(B) 0.99 0.99 0 1 1 0 | 0 1 | 0 1 | 0.0001Example0.99 0 0 Part0 III:1 Inference0 p0(A B,E) 0.0001 0 0 | InitialAssumptions: Evidence: The alarm is sounding p(R E) Explaining Away: B and E are independent causes that become dependent given A | The alarm is not directly influencedAlarm = by 1 anyBurglar report on theEarthquake radio, The remaining tables are p(B = 1) = 0.01 and p(E = 1) = 0.000001.Thetables The remaining tablesp(A R, are E,p(B B=)= 1) =p( 0A.E,R01E,Bandp(pB)(E =1=0.9999 1),E,A = 0.000001=1.Thetables,R1) 1 Radio = 1 Earthquake and graphicaland graphical structurep(B structure=1 fully| A specify fully= specify1) the = distribution. the distribution.| 0.990002 The radio broadcast is not directly0.99 influenced by1 the burglar0 variable, 1 1 | P B,E,R p(B,E,A =1,R) ⇡ p(R E,B)=p(R E) 0.99 0 1 0 0 | p|(BE,R=1p(A =1A =1B =1,R,E=)p 1)(B = 1)0.p0101(E)p(R E) Burglaries don’t= directlyP ‘cause’0.0001 earthquakes,| p0(E B)=p(E0) | 0.99 | p(A =1B,E)p(B⇡|)p(E)p(R E) ⇡ Therefore P B,E,R | | p(A, R, E, B)=p(APE,B)p(R E)p(E)p(B) | | The remaining tables are p(B = 1) = 0.01 and p(E = 1) = 0.000001.Thetables Additional Evidence: The radioand broadcasts graphical structure an earthquake fully specify warning: the distribution.

A similar calculation gives p(B =1A =1,R= 1) 0.01. | ⇡

Initially, because the alarm sounds, Sally thinks that she’s been burgled. However, this probability drops dramatically when she hears that there has been an earthquake.

The earthquake ‘explains away’ to an extent the fact that the alarm is ringing. Directed Graphs and Evidence Incorrect Transformation: CreateExample undirected – graph Part by II: dropping Specifying edge orientation. the Tables

B E B E

A R A R Falsely implies that given A, ExampleB and E are Part independent. III: Inference p(A B,E) | Initial Evidence: The alarm is sounding p(R E) Explaining Away: B and E are independentAlarm = causes 1 Burglar that becomeEarthquake dependent given A |

E,R p(B =10.9999,E,A=1,R1) 1 Radio = 1 Earthquake p(B =1A = 1) = 0.99 1 0.9900020 1 1 | P B,E,R p(B,E,A =1,R) ⇡ 0.99 0 1 0 0 p(BE,R=1p(A =1A =1B =1,R,E=)p 1)(B = 1)0.p0101(E)p(R E) = P 0.0001| 0 0 | 0.99 | p(A =1B,E)p(B⇡)p(E)p(R E) ⇡ P B,E,R | | P The remaining tables are p(B = 1) = 0.01 and p(E = 1) = 0.000001.Thetables Additional Evidence: The radioand broadcasts graphical structure an earthquake fully specify warning: the distribution.

A similar calculation gives p(B =1A =1,R= 1) 0.01. | ⇡

Initially, because the alarm sounds, Sally thinks that she’s been burgled. However, this probability drops dramatically when she hears that there has been an earthquake.

The earthquake ‘explains away’ to an extent the fact that the alarm is ringing. Directed Graphs and Evidence Correct Transformation: CreateExample factor graph – linking Part each II: node Specifying to its parent(s), the Tables or an undirected moral graph linking parents with common children.

B E B E B E

A R A R A R

Example Part III: Inference p(A B,E) | Initial Evidence: The alarm is sounding p(R E) Explaining Away: B and E are independentAlarm = causes 1 Burglar that becomeEarthquake dependent given A |

E,R p(B =10.9999,E,A=1,R1) 1 Radio = 1 Earthquake p(B =1A = 1) = 0.99 1 0.9900020 1 1 | P B,E,R p(B,E,A =1,R) ⇡ 0.99 0 1 0 0 p(BE,R=1p(A =1A =1B =1,R,E=)p 1)(B = 1)0.p0101(E)p(R E) = P 0.0001| 0 0 | 0.99 | p(A =1B,E)p(B⇡)p(E)p(R E) ⇡ P B,E,R | | P The remaining tables are p(B = 1) = 0.01 and p(E = 1) = 0.000001.Thetables Additional Evidence: The radioand broadcasts graphical structure an earthquake fully specify warning: the distribution.

A similar calculation gives p(B =1A =1,R= 1) 0.01. | ⇡

Initially, because the alarm sounds, Sally thinks that she’s been burgled. However, this probability drops dramatically when she hears that there has been an earthquake.

The earthquake ‘explains away’ to an extent the fact that the alarm is ringing. P3@$! 'Û 9I8SÛ #*.ƒ

P3@$! 'Û 9I8SÛ #*.ƒ Ýßŭ @ ƒ H¹ Û ŭ' ŭ rŭ 'ŭ3' ŭ '3 ŭ  3- H¹ Û H¿T¹ Û %  Lŭ Qŭ  ŭ 9 ŭ  ŭ ŭ 3ŭ *ŭ 6'Lŭ _ŭ H¿W¼ Û Ýßŭ @ ƒ H¹ Û ŭ' ŭ rŭ 'ŭ3' ŭ '3 ŭ  3- ºÛ ¿Û ºÛ @A ¿Û  Ž'0ŭ  3  ëŭ ŭ '30ŭ ŭ 3' ŭ  %  Lŭ Qŭ  ŭ 9 ŭ  ŭ ŭ 3ŭ *ŭ 6'Lŭ _ŭ ŭ  ŭH¹ Û3' ŭH¿T¹ Ûŭ D'  ŭ ŭH¿W¼ Û  ŭŭ ŭ HÇ[» Û HÍ V¿ Û  Ž'0ŭ  3  ëŭ ŭ '30ŭ ŭ 3' ŭ HÇU¹  ŭŭºÛ ']ŭ¿Û ŭ ŭ'Ôŭ ºÛ @A +ŭ¿Û  S ŭ HÍ V¿ Û ŭD'  Lŭ Q ŭ  ŭ .1 rŭ ŭ+' ŭ ŭ % ŭ  ŭ ŭ  ŭ3' ŭ ŭ D'  ŭ ŭ  ŭŭ ŭ 'A 6A 'A 6A HÇ[» Û HÍ V¿ Û   ŭHÇU¹ ' ŭ''0 ŭ _ŭŭŭŭ ŭ' ŭ ŭ  ŭŭ']ŭ ŭ ŭ'Ôŭ+ŭ S ŭ  HӃ HÍ V¿ Û ŭD'  Lŭ Q ŭ  ŭ .1 rŭ ŭ+' ŭ ŭ % ŭ  ŭ HÒWÈ 3 ŭ ' $ŭ'A ŭ ŭ6A ŭ ŭ ŭ'A  ŭŭ '3 6A 6!C "613C   ŭ ' ŭ''0 ŭ _ŭŭŭŭ ŭ' ŭ ŭ ƒ ;A   %32&C 5ŭ  ŭ  9  3ŭ  ŭHӃ  ' 3 ŭ +ŭ 0- HÒWÈ 3 ŭ ' $ŭ ŭ ŭ ŭ ŭ ŭ ŭŭ '3 6!C "613C P3@$! 'Û 9I8SÛ'0ŭH 3ŭ>%3 ŭŭ ŭ%ŭ+ŭ '-#*.ƒ ƒ ;A %32&C 5ŭ  ŭ 9  3ŭ  ŭ  ' 3 ŭ +ŭ 0- ŭ '3 3ŭ  ' Lŭ –ŭ  ŭ ŭ ŭ ŭ Directed Factor Graphs? '0ŭH 3ŭ>%3 ŭŭ ŭ%ŭ+ŭ '- D ƒ H» Û Iƒ ŭ ŭ  ŭ  '  rŭ ŭ ' ŭ ŭ 3- ŭ '3 3ŭ  ' Lŭ –ŭ  ŭ ŭ ŭ ŭ Ýßŭ @ ƒ H¹ Û ŭ' ŭ rŭ 'ŭ3' ŭ '3 ŭ  3- H¿U» Û H»¿ Û Sŭ9 S ŭD ƒ H» Û ŭ ŭ  Ž+ŭIƒ ' ' ŭŭ ŭ  ŭ  '  rŭ ŭ ' ŭ ŭ 3- H¹ ÛºÛ @A H¿T¹ Û:; ¿Û ºÛ @A ¿Û  ŭ%  Lŭ' ŭQŭ  ŭ  ŭ9 ŭŭ.1 ŭ ŭŭ ŭE4* `ŭ3ŭ *ŭ +ŭ6'Lŭ+d_ŭ  H¿W¼ Û H¿U» Û H»¿ Û Sŭ9 S ŭŭ ŭ  Ž+ŭ' ' ŭ ºÛ ¿Û ºÛ @A ¿Û 3Hŭ  Ž'0ŭ ŭ0H ŭ 3  ëŭ ŭ>% ŭ'30ŭ ŭŭ3' ŭ ' -  ºÛ @A :; ¿Û ºÛ @A ¿Û  ŭ' ŭ  ŭŭ.1 ŭŭE4* `ŭ +ŭ+d HÇX¹ Û HÍ Y¿ Û HÈY¹ Û HÎ\¿ Û ŭ  ŭ 3' ŭ ŭ D'  ŭ ŭ  ŭŭ ŭ ıŭŭ. Ž 0ŭ3 ŭŭ ' 33ŭ'3 ŭ3Hŭ ŭ0H ŭ>% ŭŭ ' - HÇ[» Û HÍ V¿ Û  ŭHÇX¹ Ûŭ']ŭHÍ Y¿ Û ŭ ŭ'ÔŭHÈY¹ Û HÎ\¿ Û+ŭ S ŭ HÇU¹  3 'ŭ ıŭŭ. Ž 0ŭ3 ŭŭ ' 33ŭ'3 ŭ  HÍ V¿ Û    ŭD'  Lŭ Q ŭ  ŭ .1 rŭ ŭ+' ŭ ŭ % ŭ  ŭ HÒZÉÍ'A Û 6A HÓWÈ'AÍBÛ 6A  3 'ŭ     *3Lŭŭ6'ŭ' ŭ''0 ŭ  ŭ  ŭ 3' ŭ _ŭŭ+' ŭŭ  ŭŭ ŭ ŭ ' ŭ ŭ .1ŭ ŭ3ŭ  HӃ HÒZÉÍ Û HÓWÈÍBÛ HÒWÈ ƒ >A *ŭ3 ŭ ' $ŭ6Lŭ 1 'ŭŭŭ  ŭ ŭŭ ŭ0¥ŭŭ ŭ ŭ ŭŭ ŭ'3 6!C0 ŭ"613C ŭ*3Lŭ 6'ŭ  ŭ ŭ 3' ŭ +' ŭ ŭŭ ŭ .1ŭ 3ŭ ƒ ;A 3 ŭ%32&C   5ŭŭ ŭƒ 3+ŭ ŭ 9  3ŭ ŭ' Ļŭ ŭ  ' 3 ŭ>A  %] ŭ+ŭ 0-ŭ*ŭ 6Lŭ 1 'ŭ  ŭ ŭ0¥ŭ ŭ ŭ ŭ0 ŭ  ŭ N ƒ Eg *ŭ'0ŭH 3ŭ6'$ŭ  ŭ ŭ>%3 ŭ3 ŭ  ŭ   ŭŭ ŭ ŭŭ ŭ%ŭŭ*ŭ+ŭ '-6 µŭ3 ŭ   ŭ ŭ 3+ŭ ŭ' Ļŭ  %] ŭŭ /» ¿ Û 1¹ ¿ Û _ŭ ŭ  ŭN ƒ 9 0$ŭ '3 3ŭ 'ŭ ' Lŭ3Eg ŭ  ŭ™% ŭ–ŭ  ŭŭ ŭ3'ŭ ŭ ŭ*ŭ6'$ŭ  ŭ ŭ3 ŭ  ŭ   ŭ ŭŭ ŭŭ*ŭ6 µŭ D ƒ H» ÛºÛ ¿Û Iƒ 5A ?A ¿Û ' ŭ ŭ/»ŭ¿ Û 'ŭ +D' ŭ  ŭ1¹Ĵ% ŭ¿ Û ŭ D -_ŭ  ŭ9 0$ŭ 'ŭ3 ŭ  ŭ™% ŭ ŭ3'ŭ ŭChain ŭºÛ  ŭ Graphs: '  rŭ¿Û Anotherŭ5A  ' hybrid?A ŭ¿Û ŭ 3- 0¹Ç 1¿Í Û 1» Ç Û 2¿Í Û 3ŭ3' ŭ Lŭ ~ ŭ ŭ ŭ ŭ 0 ŭ  ŭ ŭ' ŭ ŭ ŭ 'ŭ +D' ŭ  ŭĴ% ŭ ŭ D - H¿U» Û H»¿ Û Sŭ9 S ŭŭ ŭ  Ž+ŭ' ' ŭ $ŭ0¹Ç 'ŭof directed1¿Í Û ŭ'ŭŭ & undirected1» Ç ÛD0  ŭ2¿Í q  ŭ Û +%'d3ŭ3' ŭ Lŭ ~ ŭ ŭ ŭ ŭ 0 ŭ  ŭ ŭ  ºÛ @A :; ¿Û ºÛ @A ¿Û  ŭ' ŭ  ŭŭ.1 ŭŭE4* `ŭ +ŭ+d &A 7A &A 6A 3 yŭgraphicalŭ'ŭ %' ŭmodels'ŭ (seeŭ readings)%  ŭ q'3ļŭ -$ŭ'ŭ ŭ'ŭŭD0  ŭ q  ŭ+%'d  Ç Í 3Hŭ ŭ0H ŭ>% ŭ&A ŭ ' - HÇX¹ Û HÍ Y¿ Û HÈY¹ Û HÎ\¿ Û &A 7A 6A 3 yŭŭ'ŭ%' ŭ'ŭŭ%  ŭ q'3ļŭ-   1Ê ıŭ 0 yŭŭ. Ž 0ŭÇ Í 3 ŭŭ ' 33ŭ  '3 ŭ  1Ê  0 yŭ >A   ƒ    3 'ŭ” $ŭŭ%ŭ0/ŭ ŭŭ ŭ  ŭŭŭ  ŭ    ƒ >A ” $ŭŭ%ŭ0/ŭ ŭŭ ŭ  ŭŭŭ  ŭ HÒZÉÍ Û HÓWÈ ÍBÛ A rarely-usedŭ*3Lŭ proposalŭ6'ŭ%' ŭ  ŭ byŭ ŭ 3' ŭBrendanOD ŭ+' ŭ Frey,'% ŭ ŭ UAIŭ /0Jrŭ2003 ŭ .1ŭ 3ŭ *3Dŭ6’ŭ ^Û 3Û5FÛ Õ`ÛMwkÛp^cµ¦¬Ûq¬^¨wÛ¦`µ^¡kfÛ `ÍÛ c¬k^µ× ŭŭ%' ŭ ŭOD ŭ'% ŭ/0Jrŭ ¡qÛ ¦¡kÛ l·¡cµ¦¡Û ¡¦fkÛ²¦zfÛ`^c‹Û ²ª½^¬kÛ l¦¬Û k^cwÛ ¿^¬„^`—kÛ *3Dŭ ŭ6’ŭ ^ÛÙŭ 3Û5FÛ0/ŭÕ`Û ŭMwkÛŭp^cµ¦¬Ûq¬^¨wÛ]ŭ¦` ] ŭµ^¡kfÛ+ŭ`ÍÛŭc¬k^µ×  ]ŭ ƒ >A *ŭ 6Lŭ 1 'ŭ  ŭ ŭ0¥ŭ ŭ ŭ ŭ0 ŭ  ŭ ŭ Ùŭ 0/ŭ ŭ ŭ ]ŭ ] ŭ +ŭ ŭ  ]ŭ ^¡fÛc¦¡¡kcµ¡qÛk^cwÛp»¡cµz¦¡Û¡¦fkÛµ¦ÛµwkÛc¦¬¬k²¨¦¢f¡qÛcwfÛ ¡qÛŭ¦¡kÛŭ l·¡cµ¦¡Û' ŭ¡¦fkÛ ŭ²¦zfÛÞ% ŭ`^c‹Û ²ª½^¬kÛ'% ŭl¦¬Û¥Jŭk^cwÛ ¿^¬„^`—kÛ Qŭ ^¡fÛ3 ŭc¦¡¡kcµ¡qÛ   ŭ ŭk^cwÛ 3+ŭp»¡cµz¦¡Û ŭ¡¦fkÛ' Ļŭµ¦ÛµwkÛc¦¬¬k²¨¦¢f¡qÛ  %] ŭcwfÛ ŭŭ ŭ ' ŭ  ŭ Þ% ŭ '% ŭ ¥Jŭ Qŭ ^¡fÛ µ²Û ¨^¬k¡µ²Û cÛ 3Û f¬kcµkfÛ l^cµ¦¬Û q¬^¨wÛ c^¡Û `kÛ ¶²kfÛ 9 0ŭ*ŭ6'$ŭ  ŭ ŭ ŭ ŭ ŭ3 ŭ+ŭ ŭŭ  'ŭ   ŭ ŭ /ŭŭ ŭ3Dŭŭ*ŭŭ6 µŭ µ¦Û ¡fzc^µkÛN ƒ c¦¡fzµz¦¡^Û ¨¬¦`^`zµzk²ÛEg fÛ 3Û wÍ`¬fÛ c¦¡²µ¯ »cÖ ^¡fÛ µ²Û ¨^¬k¡µ²Û cÛ 3Û f¬kcµkfÛ l^cµ¦¬Û q¬^¨wÛ c^¡Û `kÛ ¶²kfÛ 9 0ŭ  ŭ ŭ ŭ +ŭ ŭ  'ŭ  /ŭ 3Dŭ ŭ /» ¿ Û 1¹ ¿ Û µ¦Û ŭ_ŭ¡fzc^µkÛ  ŭ ŭ9 0$ŭc¦¡fzµz¦¡^Û ŭ 'ŭ¨¬¦`^ŭ`zµzk²Û33'ŭ ŭfÛ ŭ   ŭ3Û ™% ŭwÍ`¬fÛŭc¦¡²µ¯'%  ŭ3'ŭ»cÖ - µz¦¡Û Ãxk¯ kÛ-; ^¡fÛ0; ^¬kۛ¦fk˜kfÛz¡Û^¡Û ·¡f¬kcµkfÛl^²w¦¡Û ^²Û ŭ ŭ ŭ ŭ3'ŭ   ŭ'%  ŭ- ºÛ ¿Û 5A ?A ¿Û µz¦¡Û LŭÃxk¯QŭkÛ-; %' ŭ^¡fÛ0; ^¬kۛ¦fk˜kfÛ_ ŭz¡Û ŭ^¡Û ·¡f¬kcµkfÛŭ ŭl^²w¦¡Û  3 ŭ^²Û :Zg0 g ^¡fÛ4; KS ^¡fÛ 8; ^¬kÛ ›¦fkkfÛ z¡Û ^Û fz¬kcµkfÛp^²u¦¡Û ' ŭ ŭ ŭ 'ŭ +D' ŭ  ŭĴ% ŭ ŭ D - Lŭ Qŭ %' ŭ _ ŭ ŭ ŭ ŭ   3 ŭ :Zg0 g ^¡fÛ4; KS ^¡fÛ 8; ^¬kÛ ›¦fkkfÛ z¡Û ^Û fz¬kcµkfÛp^²u¦¡Û kÛ /ƒ D^°Œ¦¿Û0¹Ç ¬^¡f¦›Û1¿ÍlkŽfÛ Û D K= Û lÛ NwkÛp^cµ¦¬Û¿Í q°^¨uÛ Û ¦`Ù +ŭŭ 3 ŭŭ+ 3ŭŭ ŭ- 1» Ç Û 2 kÛ3ŭ/ƒ D^°Œ¦¿Û3' ŭ¬^¡f¦›ÛLŭlkŽfÛ~ ŭD K= Û lÛŭNwkÛŭp^cµ¦¬Û ŭq°^¨uÛ0 ŭ¦`Ù  ŭ ŭ+ŭŭ 3 ŭŭ+ 3ŭŭ ŭ- µ^¡kfÛ`ÍÛc¬k^µz¡qÛ¦¡kÛl»¡cµz¦¡Û¡¦fkÛl¦¬Ûk^cwۛ^ā›^Ûc˜©¶kÛ  'ŭ3 /$ŭ  ŭŭ' ŭ'ŭ ŭŭŭ µ^¡kfÛ$ŭ`ÍÛc¬k^µz¡qÛ'ŭ¦¡kÛl»¡cµz¦¡Û ŭ'ŭŭ¡¦fkÛl¦¬ÛD0  ŭk^cwۛ^ā›^Ûq  ŭc˜©¶kÛ+%'d 'ŭ3 /$ŭ  ŭŭ' ŭ'ŭ ŭŭŭ ^¡fÛ c¦¡¡kcµ¡qÛ&A µwkÛ ¿^¬^`k²Û ¡Û k^cwۛ^ā›^Û&A c“«·kÛ µ¦Û µwkÛ 7A 6A ^¡fÛ 3 yŭŭc¦¡¡kcµ¡qÛ  3 ŭŭ'ŭµwkÛ ¿^¬^`k²Ûŭ%' ŭŭ¡Û'ŭk^cwÛŭ ŭ›^ā›^Û ŭ%  ŭc“«·kÛ D'ŭq'3ļŭµ¦Û µwkÛ- ŭ c¦¬¬k²¨¦¡fz¡qÛ l»¡cµz¦¡ÛÇ Í ¡¦fkÛ   ŭ   3 ŭ ŭ ŭ ŭ  ŭ  D'ŭ ŭ   1Ê c¦¬¬k²¨¦¡fz¡qÛÓŭ 0 yŭ]ŭl»¡cµz¦¡Ûŭ¡¦fkÛ *ŭ 6ŭ  ŭ ŭ Å']ŭ  ŭÓŭ ]ŭ ŭ *ŭ 6ŭ  ŭ ŭ Å']ŭ  ŭ ŭ ŭ.1ŭŭ *Lŭ 6$ŭ ŭFSŭHSŭ 0ŭ ŭŭ ŭ.1ŭŭ *Lŭ 6$ŭ ŭFSŭHSŭ 0ŭ ŭ *Lŭ 6ŭ  ŭŭƒ.1ŭ  ŭ '  ŭ ŭ+ŭ>A '- ” $ŭŭ%ŭ0/ŭ ŭŭ ŭ  ŭŭŭ  ŭ *Lŭŭ6ŭ%3' ŭ  ŭŭ .1ŭ 3ŭ  ŭ ŭ '*FS  ŭHS ŭŭ+ŭ ŭ ŭ'-+ŭ ŭŭ %3' ŭ  3ŭ ŭ*FSHS ŭ ŭ ŭ+ŭ ŭ ]H]þŭ ŭŭ%' ŭ ŭOD ŭ'% ŭ/0Jrŭ *3Dŭ6’ŭ ^Û 3Û5FÛ Õ`ÛMwkÛp^cµ¦¬Ûq¬^¨wÛ¦`µ^¡kfÛ`ÍÛ c¬k^µ× ]H]þŭ  ŭŭŭŭ ŭ3' ŭ+ ŭ ŭ +µŭ  ŭŭŭŭ ŭ3' ŭ+ ŭ ŭ +µŭ ¡qÛ ¦¡kÛ l·¡cµ¦¡Û ¡¦fkÛ²¦zfÛ`^c‹Û ²ª½^¬kÛ l¦¬Û k^cwÛ ¿^¬„^`—kÛ Ąŭ ŭ  ŭÙŭ'ŭ0/ŭ0 ŭ ŭ ŭ   ŭ]ŭ ] ŭŭ' %'  $ŭ+ŭ ŭ  ]ŭĄŭ  ŭ'ŭ0 ŭ   ŭ ŭ' %'  $ŭ *FSHSJSKSMS; *F*H1F*J2F*K1H*M1JSKS *FSHSJSKSMS; *F*H1F*J2F*K1H*M1JSKS ^¡fÛc¦¡¡kcµ¡qÛk^cwÛp»¡cµz¦¡Û¡¦fkÛµ¦ÛµwkÛc¦¬¬k²¨¦¢f¡qÛcwfÛ ŭ% ŭŭŭ' ŭ0 ŭ 'v'ŭ ŭ Þ% ŭ% ŭ' 3H êŭ'% ŭ ¥Jŭ ŭ  'D ŭQŭ % ŭ ŭ 0 ŭ 'v'ŭ  % ŭ' 3H êŭ  ŭ  'D ŭ ^¡fÛ µ²Û ¨^¬k¡µ²Û cÛ 3Û f¬kcµkfÛ l^cµ¦¬Û q¬^¨wÛ c^¡Û `kÛ ¶²kfÛ 9 0ŭ  ŭ ŭ ŭ +ŭ ŭ  'ŭ  /ŭ 3Dŭ ŭ Z ŭ  % ŭ ŭŭ %' ŭ+ŭ‚ŭ D'  $ŭ ŭ ŭ ŭ µ¦Û5 ŭ¡fzc^µkÛ c¦¡fzµz¦¡^Û ¨¬¦`^`zµzk²Û fÛ 3Û wÍ`¬fÛ c¦¡²µ¯ »cÖ 5 ŭ ŭZ ŭ ŭ  ŭ % ŭ ŭŭŭ %' ŭ3'ŭ+ŭ   ŭ‚ŭ D'  $ŭ'%  ŭ ŭ - ŭµz¦¡Û'ŭÃxk¯   ŭkÛ-; ^¡fÛ0;3 ŭ^¬kÛ% 3ŭ›¦fk˜kfÛŭ' ŭz¡Û^¡Û ŭ·¡f¬kcµkfÛ ŭl^²w¦¡Û¢ŭ%'Š^²Û ŭ5ŭ Lŭ'ŭ'  ŭ   ŭQŭ %' ŭ' ŭ3 ŭ% 3ŭŭ_ ŭŭ' ŭ 3Ž' ŭ ŭ ŭŭ ŭ +' ŭŭ   3 ŭ¢ŭ%'Š  ŭ5ŭ'  ŭ' ŭ ŭ 3Ž' ŭ+' ŭ  ŭ :Zg ŭ0 Lŭ g ^¡fÛ4;*ŭKS ^¡fÛ6 ŭ8;  ŭ^¬kÛ ›¦fkkfÛŭ ' ŭz¡Û  ŭ^Û fz¬kcµkfÛp^²u¦¡Ûŭ ŭ ŭ ŭ LŭD' ŭ*ŭ 6 ' ŭŭ  ŭ  ²ŭŭ ' ŭ' 3ŭ ŭ ŭ ŭ E4*ŭ ŭ ŭ ŭ D' ŭ ' ŭ   ²ŭ ' 3ŭ ŭ E4*ŭ ŭ kÛ /ƒ D^°Œ¦¿Û¬^¡f¦›ÛlkŽfÛD K= Û lÛ NwkÛp^cµ¦¬Û q°^¨uÛ ¦`Ù +ŭŭ 3 ŭŭ+ 3ŭŭ ŭ- +%' ŭ  ŭŭ 0ŭ 0'¥ŭ >% zŭ ŭ ŭ 'd +%' ŭ  ŭŭ 0ŭ 0'¥ŭ >% zŭ ŭ ŭ 'd *3Lŭ6Lŭ 53 ŭE4*ŭ '  ŭ ŭ ŭ.1ŭ 3ŭ*3ŭ 6`ŭ µ^¡kfÛ`ÍÛc¬k^µz¡qÛ¦¡kÛl»¡cµz¦¡Û¡¦fkÛl¦¬Ûk^cwۛ^ā›^Ûc˜©¶kÛ *3Lŭ 'ŭ3 /$ŭ6Lŭ 53 ŭ ŭE4*ŭŭ'  ŭ' ŭ'ŭ ŭ ŭ ŭ.1ŭ 3ŭŭ*3ŭŭ6`ŭ  ŭ%'  ŭŭ ŭ  ŭ'ŭ%' ŭ  ŭ%'  ŭŭ ŭ  ŭ'ŭ%' ŭ ŭ  '  ŭ ]ŭ ']H]$ŭ ċš$ŭ/; 3;8A9 ; ^¡fÛ c¦¡¡kcµ¡qÛ µwkÛ ¿^¬^`k²Û ¡Û k^cwۛ^ā›^Û c“«·kÛ µ¦Û µwkÛ ŭ ŭ   3 ŭ '  ŭ ]ŭŭ ŭ']H]$ŭ ŭ  ŭċš$ŭ D'ŭ/; 3;8A9 g ;ŭ g Lŭ *>% $ŭ ' ŭŭ% ŭ ŭ' ŭ  ŭ Lŭ *>% $ŭ ' ŭŭ% ŭ ŭ' ŭ  ŭ DF SHDFSJDHSKQJ SK SMS 5 ŭZ ŭ 3 D ŭ c¦¬¬k²¨¦¡fz¡qÛ l»¡cµz¦¡Û ¡¦fkÛ ÓŭDF S]ŭHDFSJDHSŭKQJ S *ŭK SMS6ŭ5 ŭ  ŭZ ŭ 3ŭ Å']ŭ  ŭD ŭ 3ŭŭ 3'0ŭzŭ % ŭŭ %ŭ' zŭŭ' ŭ 3ŭŭ 3'0ŭzŭ % ŭŭ %ŭ' zŭŭ' ŭ  ŭŭ %' ŭŭ %' 3 ²ŭ ŭ ŭ'ŭ   ŭ ŭD -  ŭŭŭ ŭ %' ŭ.1ŭŭŭ*Lŭ)ƒ %' 3 ²ŭ6$ŭ ŭ ŭFSŭŭ'ŭHS   ŭŭ 0ŭ ŭD -ŭ )ƒ 'ŭ Ìŭŭ]ŭ] ŭ' ŭŭ- 'ŭ Ìŭŭ]ŭ] ŭ' ŭŭ- ŭŭ' ŭ ŭ ŭ)ƒ%' ŭ ŭ *ŭ6ŭ  ŭ *Lŭ 6ŭ  ŭŭ .1ŭ  ŭ '  ŭ ŭ+ŭ'- ŭŭ %3' ŭŭ' ŭ ŭ 3ŭ ŭ ŭ)ƒ%' ŭ*FSHS ŭŭ ŭ*ŭ ŭ6ŭ  ŭ+ŭ ŭ  Lŭ  Lŭ ŭ' ŭ' ŭ ŭ ]H]þŭ ŭ' ŭ  ŭŭŭ' ŭŭ ŭŭ3' ŭ+ ŭ ŭ +µŭ GB$7?087J 8&J '!B8;J <:,?J GB$7?087J 8&J '!B8;J +<:,?J 5ŭ    ŭ  d Ąŭ  ŭ'ŭ0 ŭ+    ŭ5ŭ   ŭ ŭ' %'  $ŭ  d *FSHSJSKSMS; *F*H1F*J2F*K1H*M1JSKS   ŭ ŭ ' ŭ  $ŭ ŭ ŭ ' ŭ   ŭ ŭ   ŭ % ŭŭŭ' ŭ0 ŭ 'v'ŭ $ŭ % ŭŭ ŭ' 3H êŭ ' ŭ   ŭ ŭ  'D ŭ ŭ 5 ŭZ ŭ  % ŭ ŭŭ %' ŭ+ŭ‚ŭ D'  $ŭ ŭ ŭ ŭ'ŭ    ŭ3 ŭ% 3ŭŭ' ŭ ŭ ŭ¢ŭ%'Š 5ŭ'  ŭ' ŭ ŭ 3Ž' ŭ+' ŭ  ŭ ŭ  Lŭ *ŭ 6 ŭ  ŭ ŭ ' ŭ  ŭ ŭ ŭ  ŭ D' ŭ ' ŭ   ²ŭ ' 3ŭ ŭ E4*ŭ ŭ +%' ŭ  ŭŭ 0ŭ 0'¥ŭ >% zŭ ŭ ŭ 'd *3Lŭ6Lŭ 53 ŭE4*ŭ '  ŭ ŭ ŭ.1ŭ 3ŭ*3ŭ 6`ŭ

 ŭ%'  ŭŭ ŭ  ŭ'ŭ%' ŭ ŭ  '  ŭ ]ŭ ']H]$ŭ ċš$ŭ/; 3;8A9 g ; Lŭ *>% $ŭ ' ŭŭ% ŭ ŭ' ŭ  ŭ DF SHDFSJDHSKQJ SK SMS 5 ŭZ ŭ 3 D ŭ 3ŭŭ 3'0ŭzŭ % ŭŭ %ŭ' zŭŭ' ŭ  ŭŭ %' ŭŭ)ƒ %' 3 ²ŭ ŭ ŭ'ŭ   ŭ ŭD - 'ŭ Ìŭŭ]ŭ] ŭ' ŭŭ- ŭŭ' ŭ ŭ ŭ)ƒ%' ŭ ŭ *ŭ6ŭ  ŭ  Lŭ ŭ' ŭ' ŭ ŭ GB$7?087J 8&J '!B8;J +<:,?J 5ŭ    ŭ  d   ŭ ŭ ' ŭ  $ŭ ŭ ŭ ' ŭ   ŭ ŭ Probabilistic Graphical Models

x1 x2 x1 x2 x1 x2

x3 x3 x3

x4 x5 x4 x5 x4 x5 Directed Graphical Model Factor Graph Undirected Graphical Model () (Hypergraph) () Markov Properties (coming later) A more detailed view Markov Properties of factorization, Factorization useful for algorithms Factorization CS242: Lecture 1B Outline Ø Factor graphs Ø Loss Functions and Marginal Distributions Ø Variable Elimination Algorithms

y

z Inference with Two Variables X X X

Y Y Y Inference via Table Lookup: p(x, y)=p(x)p(y x) (a)| p(y (b)x =¯x) (c) | Prediction via Marginalization: Inference via Bayes Rule: p(¯y x)p(x) p(y)= p(x)p(y x) p(x y =¯y)= | | x | p(¯y) X Marginal & Conditional Distributions

y

z p(x, y, z) p(x, y)= p(x, y, z) p(x)= p(x, y) p(x, y Z = z)= | p(z) z y X2Z X2Y • Target: Subset of variables (nodes) whose values are of interest. • Condition: Whatever other variables (nodes) we have observed. • Marginalize: Everything that’s not a target or observation. This is a finite sum, but the number of terms may be exponentially large. Inference & Decision Theory y observed data, can take values in any space 2Y x unknown or hidden variables (assume discrete for now) =2X action is to estimate the hidden variables A X L(x, a) table giving loss for all possible mistakes • The posterior expected loss of taking action a is ⇢(a y)=E[L(x, a) y]= L(x, a)p(x y) | | | x X2X • The optimal Bayes decision rule is then (y) = arg min ⇢(a y) a | 2X • Challenge: For graphical models, the decision space is exponentially large Decomposable Loss Functions • Estimate hidden variables x from observations y via posterior expected loss: ⇢(a y)=E[L(x, a) y]= L(x, a)p(x y) | | | x • Many practical losses decompose into aX 2termX for each variable node: N N L(x, a)= L (x ,a ) ⇢(a y)= L(xi,ai)p(xi y) i i i | | i=1 i=1 xi XN X X Hamming Loss: For how many L(x, a)= I(xi = ai) 6 nodes is our estimate incorrect? i=1 XN L(x, a)= (x a )2 Quadratic Loss: Expected squared i i distance minimized by posterior mean i=1 X • For any such loss, sufficient to find posterior marginals: p(x y) i | CS242: Lecture 1B Outline Ø Factor graphs Ø Loss Functions and Marginal Distributions Ø Variable Elimination Algorithms

x2 x2 x2

x1 x1 x1 x1

x3 x5 x3 Distributive Law for Functions Ø The classical distributive law of addition:

Wikipedia 3 operations 2 operations

Ø Applying the distributive law to functions of discrete variables: f(x)= g(x, y)h(x, z)= g(x, y) h(x, z) y z y · z Ø AssumingP y andP z each take on N possible⇥ P values, for any⇤ fixed⇥ P x: ⇤ O(N2) operations O(N) operations Ø If x also takes on N possible values, computing f(x) for all x has cost: O(N3) operations O(N2) operations Example: Inference in a Directed Graph

x x 2 4 • N=6 discrete variables, each taking one of K states x1 x6

x3 x5

p(x)=p(x )p(x x )p(x x )p(x x )p(x x )p(x x ,x ) 1 2 | 1 3 | 1 4 | 2 5 | 3 6 | 2 5 Example: Inference in a Directed Graph

x x 2 4 • N=6 discrete variables, each taking one of K states ? • Goal: Compute the marginal distribution of x1, x1 x6 given observations of x4 and x6

x3 x5

p(x)=p(x )p(x x )p(x x )p(x x )p(x x )p(x x ,x ) 1 2 | 1 3 | 1 4 | 2 5 | 3 6 | 2 5 p(x x =¯x ,x =¯x ) p(x )p(x x )p(x x )p(¯x x )p(x x )p(¯x x ,x ) 1 | 4 4 6 6 / 1 2 | 1 3 | 1 4 | 2 5 | 3 6 | 2 5 x x x X2 X3 X5 p(x x =¯x ,x =¯x ) p(x )p(x x )p(x x )p(¯x x )p(x x )p(¯x x ,x ) 1 | 4 4 6 6 / 1 2 | 1 3 | 1 4 | 2 5 | 3 6 | 2 5 x x x X5 X3 X2 Number of arithmetic operations to naively compute this sum: (NK4) Exponential in number of unobserved variables! O Conversion to Factor Graph Form

x2 x4 x2 m4(x2)=4(¯x4,x2) ? ? x1 x6 x1 m6(x2,x5)=6(¯x6,x2,x5)

x3 x5 x3 x5

p(x)=p(x )p(x x )p(x x )p(x x )p(x x )p(x x ,x ) 1 2 | 1 3 | 1 4 | 2 5 | 3 6 | 2 5 p(x)=1(x1)2(x2,x1)3(x3,x1)4(x4,x2)5(x5,x3)6(x6,x2,x5) p(x ,x ,x ,x x¯ , x¯ ) (x ) (x ,x ) (x ,x )m (x ) (x ,x )m (x ,x ) 1 2 3 5 | 4 6 / 1 1 2 2 1 3 3 1 4 2 5 5 3 6 2 5 1. Convert directed graph to corresponding factor graph. Algorithm is equally valid for factor graphs derived from undirected MRFs. 2. Replace observed variables by lower-order evidence factors. Iterative Variable Elimination I

x2 x2 ? ? x1 x1 m5(x2,x3)= 5(x5,x3)m6(x2,x5) x X5 (K3) operations (matrix-matrix product) x3 x5 x3 O p(x ,x ,x ,x x¯ , x¯ ) (x ) (x ,x ) (x ,x )m (x ) (x ,x )m (x ,x ) 1 2 3 5 | 4 6 / 1 1 2 2 1 3 3 1 4 2 5 5 3 6 2 5 p(x ,x ,x x¯ , x¯ ) (x ) (x ,x ) (x ,x )m (x ) (x ,x )m (x ,x ) 1 2 3 | 4 6 / 1 1 2 2 1 3 3 1 4 2 5 5 3 6 2 5 x X5 p(x ,x ,x x¯ , x¯ ) (x ) (x ,x ) (x ,x )m (x )m (x ,x ) 1 2 3 | 4 6 / 1 1 2 2 1 3 3 1 4 2 5 2 3 3. Pick some unobserved variable for elimination. Replace all factors which depend on that variable by a single, equivalent factor linking the eliminated variable’s neighbors. Repeat recursively. Iterative Variable Elimination II

x2 x2 ? ? x1 x1 m3(x2,x1)= 3(x3,x1)m5(x2,x3)

x3 (K3)X operations (matrix-matrix product) x3 O p(x ,x ,x x¯ , x¯ ) (x ) (x ,x ) (x ,x )m (x )m (x ,x ) 1 2 3 | 4 6 / 1 1 2 2 1 3 3 1 4 2 5 2 3 p(x ,x x¯ , x¯ ) (x ) (x ,x ) (x ,x )m (x )m (x ,x ) 1 2 | 4 6 / 1 1 2 2 1 3 3 1 4 2 5 2 3 x X3 p(x ,x x¯ , x¯ ) (x ) (x ,x )m (x )m (x ,x ) 1 2 | 4 6 / 1 1 2 2 1 4 2 3 2 1 3. Pick some unobserved variable for elimination. Replace all factors which depend on that variable by a single, equivalent factor linking the eliminated variable’s neighbors. Repeat recursively. Iterative Variable Elimination III

x2 ? ? x1 x1 m2(x1)= 2(x2,x1)m3(x2,x1)m4(x2) x X2 (K2) operations (matrix-vector product) O p(x ,x x¯ , x¯ ) (x ) (x ,x )m (x )m (x ,x ) 1 2 | 4 6 / 1 1 2 2 1 4 2 3 2 1 p(x x¯ , x¯ ) (x ) (x ,x )m (x )m (x ,x ) 1 | 4 6 / 1 1 2 2 1 4 2 3 2 1 x2 X1 p(x x¯ , x¯ )=Z (x )m (x ),Z= (x )m (x ) 1 | 4 6 1 1 1 2 1 1 1 1 2 1 x 3. Pick some unobserved variable for elimination. X1 Replace all factors which depend on that variable by a single, equivalent factor linking the eliminated variable’s neighbors. Repeat recursively. A General Graph Elimination Algorithm

Algebraic Marginalization Operations x2 • Marginalize out the variable associated with node being eliminated ? • Compute a new potential function (factor) involving all other variables x1 which are linked to the just-marginalized variable Graph Manipulation Operations x3 • Remove, or eliminate, a single node from the graph

• Remove any factors linked to that node, and replace them by a x2 single factor joining the neighbors of the just-removed node • Alternative view based on undirected graphs: Add edges between all ? pairs of nodes neighboring the just-removed node, forming a clique x1 A Graph Elimination Algorithm • Convert model to factor graph, encoding observed nodes as evidence factors • Choose an elimination ordering (query node must be last) • Iterate: Remove node, replace old factors by new message factor to neighbors Elimination Algorithm Complexity

x2 x2 x2

x1 x1 x1 x1

x3 x5 x3 • Elimination cliques: Sets of neighbors of eliminated nodes • Marginalization cost: Exponential in number of variables in each elimination clique (dominated by largest clique). Similar storage cost. • Treewidth of graph: Over all possible elimination orderings, the smallest possible max-elimination-clique size, minus one • NP-Hard: Finding the best elimination ordering for an arbitrary input graph (but heuristic approximation algorithms can be effective) • Limitation: Lack of shared computation for finding multiple marginals Treewidth and Elimination Order

p(x1)=?

x1

x6 x2

x0

x5 x3

x4 Treewidth = 1 (a) (b)

Where should x0 be placed in the elimination order? Treewidth and Elimination Order

p(x1)=?

x1 x2 x3 x4 x5 x6 Treewidth = 1

(a) (a) (b) (b)

Does the elimination order matter for this graph? Treewidth and Elimination Order

p(x1)=?

x1 x2 x3

x4 x5 x6

Treewidth = 2

(a) (b)

What elimination order minimizes the maximal clique? Efficient Elimination Orders may not Exist

For a square grid with N total nodes, treewidth is provably O(N0.5)