High Tree-Width Models

Beyond Junction Tree: High Tree-Width Models Tony Jebara Columbia University February 25, 2015 Joint work with A. Weller, N. Ruozzi and K. Tang Data as graphs Want to perform inference on large networks... Junction tree algorithm becomes inefficient... Figure: Social network Outline Goals: perform inference on large networks Approach: set up tasks as finding maxima and marginals of probability distribution p(x1,..., xn) Limitation: for cyclic p(x1,..., xn) these are intractable Methodology: graphical modeling and efficient solvers Verification: perfect graph theory and bounds Graphical models We depict a graphical model G as a bipartite factor graph with round variable vertices X = x ,..., xn and square { 1 } factor vertices ψ1,...,ψl . Assume xi are discrete variables. { } 1 This represents p(x1,..., xn) = Z exp c∈W ψc (Xc ) where Xc are variables that neighbor factor c P x1 x2 x3 x4 x6 x5 1 ψ1,2(x1,x2) ψ2,3(x2,x3) ψ3,4,5(x3,x4,x5) ψ4,5,6(x4,x5,x6) Figure: p(X )= Z e e e e Graphical models x1 x2 x3 x4 x6 x5 Use marginal or maximum a posteriori (MAP) inference Marginal inference: p(xi )= X \xi p(X ) MAP inference: x ∗ where p(X ∗) p(X ) i P ≥ In general: Both are NP-hard [Cooper 1990, Shimony 1994] Both are hard to approximate [Dagum 1993, Abdelbar 1998] On acyclic graphical models both are easy [Pearl 1988] But most models (e.g. Medical Diagnostics) are not acyclic Belief propagation for tree inference Acyclic models are efficiently solvable by belief propagation Marginal inference via the sum-product: Send messages from variable v to factor u µv→u(xv ) = µu∗→v (xv ) ∗ u ∈N(Yv)\{u} Send messages from factor u to variable v ′ ψu(Xu ) ∗ ′ µu→v (xv ) = e µv →u(xv∗) X ′ x′ x ∗ uX: v = v v ∈N(Yu)\{v} ψu(Xu ) Efficiently converges to p(Xu) e µv u(xu) ∝ v∈N(u) → MAP inference via max-product: swap ′ with maxX ′ QXu u P How to handle cyclic (loopy) graphical models? To make loopy models non-loopy, we triangulate into a junction tree. This can make big cliques... Messages are exponential in the size of the clique Tree-width of a graph: size of the largest clique after triangulation x1 x2 x1 x2 x1 x2 x3 x3 x3 x5 x4 x5 x4 x5 x4 Figure: Triangulating cyclic model p(X ) φ12φ23φ34φ45φ51 makes a less efficient acyclic model p(X ) φ φ φ∝ . ∝ 145 124 234 So... what if we skip triangulation? JTA messages may not converge and may give wrong answers Loopy sum-product belief propagation Alarm Network and Results Loopy sum-product belief propagation Medical Diagnostics Network and Results Loopy max-product belief propagation x1 x2 x3 x4 x˜1 x˜2 x˜3 x˜4 1 0.9 0.8 0.7 0.6 0.5 0.4 Loopy Max Product 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 Correct Max Marginals Bipartite Matching Network and Results Bipartite matching Motorola Apple Dell 0 1 0 ”laptop” 0$ 2$ 2$ C = 0 0 1 ”server” 0$ 2$ 3$ → 1 0 0 ”phone” 2$ 3$ 0$ Given W , maxC∈Bn×n ij Wij Cij such that i Cij = j Cij = 1 Can be written as a veryP loopy graphical modelP P But... max-product finds MAP solution in O(n3) [HJ 2007] Bipartite b-matching Motorola Apple Dell 0 1 1 ”laptop” 0$ 2$ 2$ C = 1 0 1 ”server” 0$ 2$ 3$ → 1 1 0 ”phone” 2$ 3$ 0$ Given W , maxC∈Bn×n ij Wij Cij such that i Cij = j Cij = b Also creates a very loopyP graphical model P P Max-product also finds MAP solution in O(n3) [HJ 2007] Bipartite generalized matching u1 u2 u3 u4 v1 v2 v3 v4 Graph G = (U, V , E) with U = u ,..., un and { 1 } V = v ,..., vn and M(.), a set of neighbors of node ui or vj { 1 } Define xi X and yi Y where xi = M(ui ) and yi = M(vj ) ∈ 1 ∈ Then p(X , Y )= Z i j ϕ(xi , yj ) k φ(xk )φ(yk ) where φ(yj ) = exp( Wij ) and ϕ(xi , yj )= (vj xi ui yj ) ui ∈yj Q Q Q ¬ ∈ ⊕ ∈ P So... why does loopy max-product work for matching? Theorem (HJ 2007) Max product finds generalized bipartite matching MAP in O(n3). Proof. Using unwrapped tree T of depth Ω(n), we show that maximizing belief at root of T is equivalent to maximizing belief at corresponding node in original graphical model. u1 v1 v2 v3 v4 u2 u3 u4 u2 u3 u4 u2 u3 u4 u2 u3 u4 So some loopy graphical models are tractable... Generalized matching BP median running time Median Running time when B=5 3 BP GOBLIN 0.15 2 1/3 0.1 t t 0.05 1 40 0 20 100 0 50 b n 20 40 60 80 100 n GOBLIN median running time Median Running time when B= n/2 4 BP 3 GOBLIN 150 1/4 2 100 t t 50 1 40 0 20 100 0 50 b n 20 40 60 80 100 n Empirically, max product belief propagation needs O( E ) messages | | Code at http://www.cs.columbia.edu/ jebara/code ∼ Generalized matching Applications: alternative to k-nearest neighbors [JWC 2009] clustering [JS 2006] classification [HJ 2007] collaborative filtering [HJ 2008] semisupervised learning [JWC 2009] visualization [SJ 2009] metric learning [SHJ 2012] privacy-preservation [CJT 2013] Generalized matching vs. k-nearest neighbors Figure: k-nearest neighbors with k = 2 (a.k.a. kissing number) Generalized matching vs. k-nearest neighbors Figure: b-matching with b =2 Generalized matching for link prediction Linking websites according to traffic similarity Left is k-nearest neighbors, right is b-matching What is a perfect graph? Figure: Claude Berge In 1960, Berge introduced perfect graphs as G perfect iff induced subgraphs H, the coloring number of H equals the clique∀ number of H. Stated Strong Perfect Graph Conjecture, open for 50 years Many NP-hard problems become polynomial time for perfect graphs [Grötschel Lovász Schrijver 1984] Graph coloring Maximum clique Maximum stable set Efficient problems on perfect graphs Coloring MaxClique MaxStableSet Coloring: color nodes with fewest colors such that no adjacent nodes have the same color Max Clique: largest set of nodes, all pairwise adjacent Max Stable Set: largest set of nodes, none pairwise adjacent Efficient problems on weighted perfect graphs 8 8 8 2 3 2 3 2 3 4 0 4 0 4 0 Stableset MWSS MMWSS Stable set: no two vertices adjacent Max Weight Stable Set (MWSS): stable set with max weight Maximal MWSS (MMWSS): MWSS with max cardinality (includes as many 0 weight nodes as possible) MWSS solvable in polynomial time via linear programming, semidefinite programming or message passing ( ˜(n5) and faster). O MWSS via linear programming max f⊤x s.t. Ax 1 x∈Rn,x≥0 ≤ x1 x2 x3 x4 110000 x x 5 6 011000 A = 001110 000111 A Rm×n is vertex versus maximal cliques incidence matrix ∈ f Rn is vector of weights ∈ For perfect graphs, LP is binary and finds MWSS in (√mn3) O Note m is number of cliques in graph (may be exponential) MWSS via message-passing Input: = ( , ), cliques = c1,..., cm and weights fi for i G V E fiC { } ∈V Initialize zj = maxi∈cj i for j 1,..., m Pc∈C[ ∈c] ∈ { } Until converged do Randomly choose a = b 1,..., m 6 ∈ { } Compute hi = max 0, fi zj for i ca cb − j:i∈cj ,j6=a,b ∈ ∪ Compute s = max h a i∈ca\cb i P Compute sb = maxi∈cb\ca hi Compute sab = maxi∈ca∩cb hi 1 Update za = max sa, 2 (sa sb + sab) 1 − Update zb = max sb, (sb sa + sab) 2 − Output: z∗ = [z ,..., z ]⊤ 1 m MWSS via semi-definite programming ϑ = max fi fj Mij s.t. Mii = 1, Mij = 0 (i, j) E M0 ∀ ∈ ij i X p X This is known as the Lovász theta-function Let M Rn×n be the maximizer of ϑ ( ) ∈ F G Let ϑ be the recovered total weight of the MWSS. Under mild assumptions, get x∗ = round(ϑM1) For perfect graphs, find MWSS in ˜(n5) O Perfect graph theory Theorem (Strong Perfect Graph Theorem, Chudnovsky et al 2006) G perfect G contains no odd hole or antihole ⇔ Hole: an induced subgraph which is a (chordless) cycle of length at least 4. An odd hole has odd cycle length. Antihole: the complement of a hole Perfect Not Perfect Perfect Other perfect graph theorems Lemma (Replication, Lovász 1972) Let be a perfect graph and let v V( ). Define a graph ′ by G ∈ G G adding a new vertex v ′ and joining it to v and all the neighbors of v. Then ′ is perfect. G x1 x2 x3 x1 x2 x3 x7 x1 x2 x3 x7 x6 x5 x4 x6 x5 x4 x6 x5 x4 Other perfect graph theorems Lemma (Pasting on a Clique, Gallai 1962) Let be a perfect graph and let ′ be a perfect graph. If ′ is G G G ∩ G a clique (clique cutset), then ′ is a perfect graph. G ∪ G x1 x2 x3 x3 x7 x8 x1 x2 x3 x7 x8 = ∪ x6 x5 x4 x4 x0 x9 x6 x5 x4 x0 x9 Our plan: reduce NP-hard inference to MWSS Reduce MAP to MWSS on weighted graph If reduction produces a perfect graph, inference is efficient Proves efficiency of MAP on Acyclic models Bipartite matching models Attractive models Slightly frustrated models (new) Reduce Bethe marginal inference to MWSS on weighted graph Proves efficiency of Bethe marginals on Acyclic models Attractive models (new) Frustrated models (new) Reduction: graphical model M NMRF N → Given an graphical model M, construct a nand Markov random field (NMRF) N: Weighted graph N(VN , EN , w) with vertices VN , edges EN and weight function w : VN R → ≥0 Each c C from M maps to a clique group of N with one ∈ node for each configuration xc , all pairwise adjacent Nodes are adjacent iff inconsistent settings for any variable Xi Weights of each node in N set as ψc (xc ) minx ψc (xc ) − c 00 01 v21 v23 01 00 x1 x2 x3 v21 v23 10 11 v21 v23 11 10 v21 v23 Figure: MRF M with binary variables (left) and NMRF N (right).

High Tree-Width Models

On Treewidth and Graph Minors

Lecture 10: April 20, 2005 Perfect Graphs

Arxiv:2106.16130V1 [Math.CO] 30 Jun 2021 in the Special Case of Cyclohedra, and by Cardinal, Langerman and P´Erez-Lantero [5] in the Special Case of Tree Associahedra

The Strong Perfect Graph Theorem

Results on Independent Sets in Categorical Products of Graphs, The

Algorithmic Graph Theory Part III Perfect Graphs and Their Subclasses

Classes of Perfect Graphs

Strong Perfect Graph Theorem a Graph G Is Perfect If and Only If Neither G Nor Its Complement G¯ Contains an Induced Odd Circuit of Length ≥ 5

Perfect Graphs the American Institute of Mathematics

An Exact Cutting Plane Algorithm to Solve the Selective Graph Coloring

On Characterizing Game-Perfect Graphs by Forbidden Induced Subgraphs

Balancedness of Some Subclasses of Circular-Arc Graphs 1