CS242: Probabilistic Graphical Models Lecture 1B: Factor Graphs, Inference, & Variable Elimination
Total Page:16
File Type:pdf, Size:1020Kb
CS242: Probabilistic Graphical Models Lecture 1B: Factor Graphs, Inference, & Variable Elimination Professor Erik Sudderth Brown University Computer Science September 15, 2016 Some figures and materials courtesy David Barber, Bayesian Reasoning and Machine Learning http://www.cs.ucl.ac.uk/staff/d.barber/brml/ CS242: Lecture 1B Outline Ø Factor graphs Ø Loss Functions and Marginal Distributions Ø Variable Elimination Algorithms Factor Graphs set of hyperedges linking subsets of nodes f F ✓ V set of N nodes or vertices, 1, 2,...,N V 1 { } p(x)= f (xf ) Z = (x ) Z f f f x f Y2F X Y2F Z>0 normalization constant (partition function) (x ) 0 arbitrary non-negative potential function f f ≥ • In a hypergraph, the hyperedges link arbitrary subsets of nodes (not just pairs) • Visualize by a bipartite graph, with square (usually black) nodes for hyperedges • Motivation: factorization key to computation • Motivation: avoid ambiguities of undirected graphical models Factor Graphs & Factorization For a given undirected graph, there exist distributions with equivalent Markov properties, but different factorizations and different inference/learning complexities. 1 p(x)= (x ) Z f f f Y2F Undirected Pairwise (edge) Maximal Clique An Alternative Graphical Model Potentials Potentials Factorization 1 1 Recommendation: p(x)= (x ,x ) p(x)= (x ) Use undirected graphs Z st s t Z c c (s,t) c for pairwise MRFs, Y2E Y2C and factor graphs for higher-order potentials 6 K. Yamaguchi, T. Hazan, D. McAllester, R. Urtasun > 2pixels > 3pixels > 4pixels > 5pixels Non-Occ All Non-Occ All Non-Occ All Non-Occ All GC+occ [36] 39.76 % 40.97 % 33.50 % 34.74 % 29.86 % 31.10 % 27.39 % 28.61 % OCV-BM [37] 27.59 % 28.97 % 25.39 % 26.72 % 24.06 % 25.32 % 22.94 % 24.14 % CostFilter [38] 25.85 % 27.05 % 19.96 % 21.05 % 17.12 % 18.10 % 15.51 % 16.40 % GCS [26] 18.99 % 20.30 % 13.37 % 14.54 % 10.40 % 11.44 % 8.63 % 9.55 % GCSF [39] 20.75 % 22.69 % 13.02 % 14.77 % 9.48 % 11.02 % 7.48 % 8.84 % SDM [27] 15.29 % 16.65 % 10.98 % 12.19 % 8.81 % 9.87 % 7.44 % 8.39 % ELAS [30] 10.95 % 12.82 % 8.24 % 9.95 % 6.72 % 8.22 % 5.64 % 6.94 % OCV-SGBM [25] 10.58 % 12.20 % 7.64 % 9.13 % 6.04 % 7.40 % 5.04 % 6.25 % ITGV [40] 8.86 % 10.20 % 6.31 % 7.40 % 5.06 % 5.97 % 4.26 % 5.01 % Ours 6.25 % 7.78 % 4.13 % 5.45 % 3.18 % 4.32 % 2.66 % 3.66 % Example: Nearest-NeighborTable 2. Comparison Spatial with the state MRF of the art on the test set of KITTI [2] Stereo Vision: Yamaguchi et al., ECCV 2012 2 Diverse M-Best Solutions in Markov Random Fields Fig.input 4.ObjectKITTI MAP Segmentation: examples. mode (Left) Batra input Original.et al., MAP (Middle) ECCV mode 2012 Disparity. (Right) Disparity errors. Compatibility potential: We introduce an additional potential which ensures that the discrete occlusion labels match well the disparity observations. We do so by penalizing occlusion boundaries that are not supported by the data ˆ ˆ occ λimp if p Bij : di(p, yfront) < dj(p, yback) • Observed nodes: Features of 2D φimageij (yfront (intensity,, yback)= color, texture,9 2…) Fig. 1: Examples of category-level segmentation(0 results otherwise on test images from VOC 2010. • Hidden nodes: Property of 3DFor world each image, (depth, “mode” above motion, is the best ofobject 10 solutions category, obtained with Div…)MBest. Input MAPneg 2nd MAP 2nd Mode Input MAP Mode We also define φij (yi) to be a function which penalizes negative disparities ˆ neg λimp if minp Bij di(p, yi) < 0 φij (yi)= 2 (0 otherwise Fig. 2: (Left) Interactive segmentation setup; (Right) Pose tracking. “Mode” above refersWe to impose the solution a obtained regularization with DivMBest on. the type of occlusion boundary, where we prefer forsimpler a diverse explanations set of highly probable (i.e., solutions coplanar instead is from preferable our existing models. than hinge which is more desir- Even if the MAP solution alone is of poor quality, a diverse set of highly probably able than occlusion). We encode this preference by defining λocc > λhinge > 0. hypotheses might still enable accurate predictions. Note that in contrast to the M-BestWe thus MAP define problem [9 our, 22, 38 computability] that involves finding potential the top M most probable solutions under a probabilistic model, our approach emphasizes diversity.We seek to produce highly probable solutions thatneg are qualitativelyneg di↵erent fromocc the MAP and from each other. Thisλocc is+ anφij important(yi)+ distinctionφij (yj because)+φij the(yi, yj )ifoij = lo literal definition of M-best MAP is not expectedneg to work wellneg in practice.occ In a λocc + φ (yi)+φ (yj )+φij (yj , yi)ifoij = ro largebdy2 state-space problem (e.g.282,000,000) anyij reasonable settingij of M (10 50) φij (oij, yi, yj)= neg neg − 1 would return solutions nearly> identicalλhinge to+ theφij MAP.(yi Ideally,)+φij we( wouldyj )+ likeB to p B ∆dij if oij = hi > | ij | 2 ij find the modes of the distribution<> learntneg by our probabilisticneg model.1 Figs. 1, 2 φ (yi)+φ (yj )+ ∆di,j if oij = co show examples of these diverse solutionsij extractedij for di↵erent tasks.S S p SiPSj | i[ j | 2 [ Overview. In this paper, we introduce the Diverse M-Best problem, which > P involves finding a diverseˆ set of:> highly probableˆ solutions2 under a discrete prob- abilisticwith model.∆di,j Our=( formulationdi(p, yi assumes) dj access(p, y toj a)) dissimilarity. function ∆( , ) measuring the di↵erence between two− solutions. We present an integer program-· · ming formulation for the Diverse M-Best problem, study its Lagrangian relax- ation.Junction We show that Feasibility: this relaxation hasFollowing an interesting work interpretation on occlusion as the ∆- boundary reasoning [15, augmented16], we energy utilize minimization higher problem, order which potentials minimizes a to linear encode combination whether a junction of three of the energy and similarity to previous solutions, thereby producing solutions with low energy and high diversity. For simplicity we sometimes will refer to these solutions as modes, although they are not guaranteed to be actual modes of the Gibbs distribution. Higher-Order MRF Potentials Higher-order models in Computer Vision Kohli & Rother 2012 7 Figure 1.2: Some qualitative results. Please view in colour. First Row: Original Image. Second Row: • Observed nodes: Features of 2D image (intensity,Unary likelihood color, labeling from TextonBoost texture, [34]. Third Row: …) Result obtained using a pairwise contrast preserving smoothness potential as described in [34]. Fourth Row: Result obtained using the Pn Potts model • Hidden nodes: Property of 3D world (depth, motion,potential [15]. Fifth Row: object Results using the Robustcategory, Pn model potential (1.7) with…) truncation parameter Q = 0.1 c , where c is equal to the size of the super-pixel over which the Robust Pn higher-order potential | | | | is defined. Sixth Row: Hand labeled segmentations. The ground truth segmentation are not perfect and many pixels (marked black) are unlabelled. Observe that the Robust Pn model gives best results. For instance, the leg of the sheep and bird have been accurately labeled which was missing in the other results. Unlike the standard Pn Potts model, this potential function gives rise to a cost that is a linear truncated function of the number of inconsistent variables (see figure 1.1). This enables the robust potential to allow Low Density Parity Check (LDPC) Codes Parity Check Factors Ø Each variable node is binary, so xs 0, 1 Ø Parity check factors2 { } equal 1 if the sum of the connected bits is even, 0 if the sum is odd (invalid codewords Evidence (observation) Factors are excluded) Ø Unary evidence factors Data equal probability that Bits each bit is a 0 or 1, given data. Assumes independent “noise” on Parity each bit. Bits Inference: 0 5 15 \ Figure 3: Belief propagation decoding of a (3,6) regular LDPC code, given a binary input message (top half) corrupted by noise with error probability ϵ =0.08 (as set up in [5]). We show the received codeword including parity bits (zero iterations, left), the bits that maximize the posterior marginals after five and fifteen BP iterations, and the final, correctly decoded signal after convergence to a loopy BP fixed point (the Stata center, home of the MIT Computer Science & Artificial Intelligence Laboratory). example systematic message encoding [5], in which the first half of the transmitted bits are the uncoded message of interest. To aid visualization, we choose this message to be a simple binary image, but the LDPC code does not model or exploit this spatial structure. To decode a noise–corrupted message via loopy BP, we pass messages in a parallel fashion. Each iteration begins by sending new messages from each variable node to all neighboring factor nodes, and then updates the outgoing messages from each parity check. As illustrated in Fig. 3, this allows information from un–violated parity checks to propagate throughout the message, gradually correcting noisy bits. These BP message updates continue until the thresholded marginal probabilities define a valid codeword, or until some maximum number of iterations is reached. In practice, BP decoding typically fails due to oscillations in the message update dynamics, rather than convergence to an incorrect codeword [5]. The outstanding empirical performance of message–passing decoders has inspired extensive research on alternatives to regular LDPC codes [5]. One widely used trick removes all cycles of length four from the factor graph, so that no two parity checks share a common pair of message bits.