HOP-MAP: Efficient Message Passing with High Order Potentials

Daniel Tarlow Inmar E. Givoni Richard S. Zemel Dept. of Science Probabilistic & Statistical Dept. of University of Toronto Inference Group University of Toronto University of Toronto

Abstract When potentials begin to range over a large subset of variables, however, these methods quickly break down: There is a growing interest in building prob- for the general problem, message updates from a high abilistic models with high order potentials order clique take time exponential in the size of the (HOPs), or interactions, among discrete vari- clique. One approach in such cases is to transform ables. Message passing inference in such the problem into a pairwise problem by adding aux- models generally takes time exponential in iliary variables. In the worst case, this will increase the size of the interaction, but in some cases the problem size exponentially. Rother et al. (2009) maximum a posteriori (MAP) inference can give transformations that are practical for the spe- be carried out efficiently. We build upon such cial case of sparse potentials. Alternatively, special- results, introducing two new classes, includ- purpose computations can sometimes be performed ing composite HOPs that allow us to flexibly directly on the original high order—typically factor combine tractable HOPs using simple logi- graph—representation (Givoni & Frey, 2009), which cal switching rules. We present efficient mes- is the approach we take in this paper. This strategy sage update algorithms for the new HOPs, applies beyond sparse potentials and is applicable for and we improve upon the efficiency of mes- a wide range of special potential structures. Unfortu- sage updates for a general class of existing nately, for a given potential it is typically not imme- HOPs. Importantly, we present both new diately clear whether it has tractable structure. Even and existing HOPs in a common represen- if it does, message updates are case-specific and must tation; performing inference with any combi- be specially derived. nation of these HOPs requires no change of Our goal is to be able to use a broad range of high representations or new derivations. order potentials (HOPs) generically within MAP mes- sage passing algorithms as easily as we use low order tabular potentials. We see three issues that must be 1 INTRODUCTION addressed:

Probabilistic graphical models are powerful tools due 1. Message updates need to be computed efficiently to their representational power, and also due to general even when factors contain very large scopes. purpose algorithms that can be applied to any (low or- der) graphical model. For a broad range of problems, 2. It should be easy to recognize when problems con- we can formulate a model in terms of graph struc- tain tractable high order structure. tures and standard potentials. Then, without further derivation, we can automatically perform inference in 3. HOP constructions should be flexible and re- the model. In particular, when the aim is to find a suable, not requiring new problem-specific deriva- most likely configuration of variables (MAP), a range tions and implementations on each use. of efficient message passing algorithms can be applied (Wainwright et al., 2005; Werner, 2008; Globerson & In Section 3, we describe the classes of atomic, building Jaakkola, 2008). block HOPs that we consider, cardinality-based and order-based. Cardinality-based potentials have been Appearing in Proceedings of the 13th International Con- used in several related works, and efficient message ference on Artificial Intelligence and Statistics (AISTATS) computations exist for both the general case and sev- 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of eral restricted classes. Our first contribution, however, JMLR: W&CP 9. Copyright 2010 by the authors. is showing that the efficiency can be improved even

812 HOP-MAP: Efficient Message Passing with High Order Potentials beyond existing efficient computations (Potetz, 2007; sent HOPs over subsets. θj : hj ∈ {0, 1} → R and |c| Tarlow et al., 2008). Our algorithm computes all mes- θc : hc ∈ {0, 1} → R assign a real value to a variable N sages in O(N log N) time, a factor of log N better than or subset assignment, respectively. The potentials we existing approaches. For the novel class of order-based present can range over any number of variables, so we potentials, we present equally efficient algorithms. For use N to generically represent |c|. the atomic potentials we consider, finding the optimal A factor graph, then, defines a (log) likelihood that assignment takes O(N) or O(N log N) time. We show takes the form that our algorithms can compute all N outgoing mes- n sages in the same asymptotic time. X X L(h) = θj(hj) + θc(hc) (1) Next, analogous in spirit to disciplined convex pro- j=1 c∈C gramming, which allows many of the manipulations and transformations required to analyze and solve con- The MAP inference problem—to find a setting vex programs to be automated (Grant et al., 2006), we of h that maximizes the likelihood hOPT = introduce two types of composite HOPs, which allow us arg maxh L(h)—is NP-hard for general loopy graphs, to build more complex HOPs by applying composition and we typically must resort to approximate optimiza- rules based on maximization or simple logical switches. tion methods. A complex HOP can be recognized to be tractable by decomposing it into atomic tractable units combined 2.1 MAP MESSAGE PASSING with allowed composition rules. Importantly, once ex- pressed as a composition, the message updates of com- Max-product belief propagation (MPBP) is an itera- posite HOPs can automatically be computed from the tive, local, message passing algorithm that can be used message computations of the atomic HOPs. to find the MAP configuration of a probability distri- bution specified by a tree-structured graphical model. Our final, more subtle contribution is the particular binary representation, message normalization scheme, When working in log space, the algorithm is known as and caching strategy for computing all outgoing mes- max-sum, and the updates involve sending messages sages from a factor at once. This “framework,” which from factors to variables, we refer to throughout, is not novel, but it is also not   X the standard. Each part has a purpose: the binary rep- 0 m˜ θc→hj (hj) = max θc(hc) + m˜ hj0 →θc (hj ) , hc\{hj } resentation exposes more structure in potentials; the j0∈c|j06=j message normalization scheme yields simpler deriva- tions; and the caching strategy leads to more efficient and from variables to factors,m ˜ h →θ (hj) = P j c 0 m˜ (h ). After a forward and back- algorithms. It should be straightforward to apply this c ∈N (hj )\c θc0 →hj j strategy to other structured potentials to build new ward pass sending messages to and from a root atomic building block HOPs. node, optimal assignments can be decoded from OPT beliefs, hj = arg maxhj b(hj), where b(hj) = Section 4 shows that well-known and novel graph con- P 0 m˜ θ 0 →h (hj). In tree-structured graphs, be- structions are easily expressible using the vocabulary c ∈N (hj ) c j liefs defined in this way give (up to a constant) max- of potentials discussed, and that once a model is ex- marginals:Φ = max L(h). In loopy graphs, pressed in this framework, it can be used in a variety j;a h|hj =a beliefs produce pseudo max-marginals, which do not of MAP inference procedures. In Section 5 we present account for the loopy graph structure but can be de- experimental results that illustrate the ease of con- coded to give approximate solutions, which have been structing models of interest with this formulation. shown to be useful in practice. 2 REPRESENTATION 2.2 BINARY REPRESENTATION Note that we have restricted our attention to binary We work with a factor graph representation, which is variable problems. To represent multinomial variables, a bipartite graph consisting of variable nodes, h = we apply a simple transformation, converting variables {h , . . . , h }, and factor nodes. Let N (h ) be the 1 n j with L states to binary variables with a 1-of-L con- neighbors of variable h in the graph. Factors, or j straint ensuring that exactly one variable in the set is potentials, θ = {θ , . . . , θ , θ , . . . , θ }, define in- 1 n n+1 n+k on. teractions over individual variables hj and subsets of variables C = {c1, . . . , ck}, c ⊆ {h1, . . . , hn}, which Since all variables are binary, we normalize messages are exactly the factor’s neighbors in the factor graph. so that the entry for hj = 0 is always 0. To enforce this With slight abuse of notation, we use θj to represent constraint, we subtractm ˜ θ →h (0) from both coordi- c j node potentials over single variables, and θc to repre- nates, giving us 0, m˜ θc→hj (1) − m˜ θc→hj (0) . We can

813 Daniel Tarlow, Inmar E. Givoni, Richard S. Zemel

then drop the argument for mθc→hj (1) and use mθc→hj potentials) over cardinality in O(N log N) time by to represent the scalar message difference. Similarly, a simple sorting then greedy procedure. We could we assume θj(0) = 0 for all node potentials by setting compute max-marginals for the star graph naively in ˜ ˜ 2 θj = θj(1) = θj(1) − θj(0). O(N log N) time by running this procedure itera- tively fixing at each iteration one value of one variable. We are then working with message differences, where a positive value indicates a variable’s preference to If we use MPBP, Potetz (2007) shows that a sin- be on, and a negative value indicates a variable’s gle message can be approximately computed in O(N) preference to be off. To recover a vector of properly time, so approximate max-marginals could be com- scaled messages from message differences, we need puted in O(N 2) time. If we compute all outgoing mes- the value of the optimal assignment relative to the sages at the same time, Tarlow et al. (2008) shows that potential of interest and the incoming messages, all N outgoing messages can be computed exactly in h i 2 θc P O(N ) time. Mj = maxhc θc(hc) + j0∈c|j06=j hj · mhj0 →θc . Define M θc to be the maximum value when Here, we give a procedure for exactly computing all no j is left out of the sum. The correctly outgoing messages in O(N log N) – the same time it θc would take to compute the optimal assignment us- scaled message vector for hj is then Mj = θc θc ing the Gupta et al. (2007) procedure. We note that M − max(0, mθc→hj ),M + min(0, mθc→hj ) . Properly scaled max-marginals for the star sur- by using a dynamic heap and reusing previous sorts, rounding a factor can then be seen as beliefs: the complexity may in reality be closer to O(N) total θc θc (O(1) amortized per message). Also note that this is Φ = M + h0, θji. We work in the scalar message j j N−1 difference representation, but this shows that we can in comparison to the O(N · 2 ) time that it would convert to and from a vector representation if needed. take to compute these messages if the potential was represented in standard tabular form. If we have a special purpose procedure for computing max-marginals over multinomial variables, as in Duchi First, as usual, sort all incoming messages in descend- ∗ ing order, yielding mhj∗ →θ, where jb is the index et al. (2007), we can convert to a binary representation b at the max-marginal level as well. Starting with max- of the incoming message with bth largest value (and marginals for a multinomial variable, c, which can take conversely define r(b) = j as the reverse index). Next, on one of M values, j ∈ {1,...,M}, define binary for b ∈ {0,...,N}, in a linear pass, compute the cu- h i variables hj = 1 ⇐⇒ c = j. A max-marginal for Pb 0 mulative sums c−1(b) = 0 mh ∗ →θ + f(b − 1) ; b =1 j 0 variable hj = 1 is exactly the max-marginal for c = j. b Pb h 0 i c0(b) = 0 mh ∗ →θ + f(b ) ; and c1(b) = For hj = 0, we can take the maximum max-marginal b =0 j 0 0 0 b for c = j | j 6= j. Pb−1 h 0 i b0=0 mhj∗ →θ + f(b + 1) In another lin- b0 ear pass, compute the cumulative maxes of 3 TRACTABLE HOPS the cumulative sums from the left, right, or L 0 R We now turn our attention to classes of HOP where both: s1 (b) = maxb0∈{0,...,b} c1(b ); s−1(b) = 0 L 0 max-marginals can be computed efficiently. The mes- maxb0∈{b,...,N} c−1(b ); s0 (b) = maxb0∈{0,...,b} c0(b ); R 0 sage computations require careful caching and book- and s0 (b) = maxb0∈{b,...,N} c0(b ). keeping; however, the efficient implementations as well The outgoing messages are then mθ→h =m ˜ θ→h (1)− as detailed derivations of the updates will be made j j m˜ θ→h (0) where publicly available1. j

m˜ (0) = max sL(r(j) − 1), sL (r(j) + 1) − m  3.1 CARDINALITY-BASED POTENTIALS θ→hj 0 −1 hj →θ

A broad class of existing tractable HOPs is where function values are based on the number of variables R R  m˜ θ→hj (1) = max s1 (r(j) − 1), s0 (r(j) + 1) − mhj →θ in the subset that are on: X θcard(h1, . . . , hN ) = f( hj) Intuitively, this is simply a dynamic programming pro- j cedure for computing maximizations over cumulative Gupta et al. (2007) shows that the optimal single as- sums, which can be done in a few linear passes. Mes- signment can be computed for a star-structured model sages are then array lookups, so the complexity of com- (i.e, containing a single HOP and arbitrary unary puting all N outgoing messages is dominated by the initial sort operation, which will be O(N log N) in the 1http://www.cs.toronto.edu/∼dtarlow/hops/ worst case.

814 HOP-MAP: Efficient Message Passing with High Order Potentials

3.1.1 Special Cardinality Potentials We can compute the constrained and unconstrained maximum weight contiguous subsequences we need for A simple special case of cardinality potentials is the all outgoing messages in linear time using dynamic potential that takes on value 0 if a set of variables programming. Given this, computing a message re- P are all on (i.e., j hj = N) and a value of −α other- quires only four array lookups, so the total time to wise. When α > 0, this encourages sets of variables to compute messages is O(N) (O(1) amortized). match specific patterns, so it has been referred to as a pattern-based potential (Komodakis & Paragios, 2009). 3.2.2 Before-After Potentials By breaking the computation into separate cases for P P j hj = N and j hj 6= N, it is straightforward to We define a before-after potential over two ordered show that the messages take the form: subsets of variables that encourages one to come before  P    the other: mθ→hj = max α + j0:j06=j min 0, mhj0 →θon , 0 −∞ if (P h ) 6= 1 ∨ (P h ) 6= 1 Due to this more restricted structure, all outgoing  i∈x i j∈y j θ (h , h ) = 0 if i > j|i ∈ x, j ∈ y, h = h = 1 messages from a pattern-based potential can be com- → x y i j  −α otherwise puted in O(N) (O(1) amortized) time by computing   P The key to computing messages for this factor is to the full sum, j0 min 0, mhj0 →θon then subtract- condition on the hard constraint being satisfied then ing min 0, m  for each individual message. hj →θon into cases i > j and i < j. We compute maximums Another special case of cardinality potentials is the po- independently, then we take the maximum over cases tential that constrains exactly b variables in a set to where α is subtracted from the case i < j. be on. Givoni and Frey (2009) gives updates to com- As with the cardinality potentials, we take cumula- pute all messages from a single potential in O(N) time, tive maxes over the incoming x messages and incoming showing that additional structure allows us to compute y messages from both the left and the right, respec- messages more efficiently than the general case. L R L R tively: sx (k), sx (k), sy (k), sy (k) for k ∈ {1,...,N}. Ignoring edge cases and only dealing with messages 3.2 ORDER-BASED POTENTIALS to hi|i ∈ x for simplicity, the basic form of the mes- sages arem ˜ (1) = max(sL(i − 1) − α, sR(i + We now introduce examples from a new class of θ→→hi y y 1)) andm ˜ (0) = M θ→ for all cases except to tractable HOP. These potentials depend on the order- θ→→hi θ→  L R  hi∗ , where M = max maxk sx (k) + sy (k + 1) , ing of variables within a factor, and they arise when  L R  maxk sy (k) + sx (k + 1) − α for k ∈ {1,...,N − 1} trying to represent spatial, temporal, or other ordering ∗ considerations. and i is the choice of i used in the optimal assign- ment of i and j. For messages to hi∗ , we need the maximum setting of i and j where i 6= i∗. This 3.2.1 Convex Set Potentials can easily be computed in another linear pass. As

A convex set in one dimension is a contiguous subset. usual, mθ→→hi =m ˜ θ→→hi (1) − m˜ θ→→hi (0). Updates to h ∈ y are almost identical but reversed. We define a convex set potential, θcvx, to be a potential j that requires the variables with value 1 in an ordered This computation leverages the same caching ideas as set to form a convex set. Before deriving message up- the general cardinality computations, which should be dates, we must develop a notion of maximum weight broadly applicable across other classes of HOP as well. contiguous sequences and an algorithm for efficiently Again, after the proper linear processing, the messages finding them. we need to send are just array lookups, so the total Suppose we have a vector of real-valued weights w = complexity is O(N) to compute all messages, or O(1) per message amortized. hw1, . . . , wN i. We define the maximum weight con- tiguous sequence problem as maxh w · h such that hj ∈ h with value 1 form a convex set. De- 3.3 COMPOSITION OF POTENTIALS fine MS(range, constraint) as the value of the max- imum weight contiguous subsequence in the given In many cases, we may wish to switch between HOPs, range obeying the constraint (or unconstrained if based on some simple logical rules. Typically, this constraint is left out). For notational simplicity, if would require deriving new message updates, even for small changes. To address this problem, we present the MS(range, hj = 1) < 0, let the value be 0 instead. The scalar message m is class of composite potentials., which can be viewed as θcvx→hj an extension of context-specific independencies, where MS(h1:j−1, hj−1 = 1) + MS(hj+1:N , hj+1 = 1) rules specify tractable HOPs instead of constant values − max (MS(h1:j−1),MS(hj+1:N )) (Boutilier et al., 1996).

815 Daniel Tarlow, Inmar E. Givoni, Richard S. Zemel

Suppose we have two disjoint subsets of hc, hs ∪ hp = 4 USING HOPS hc and hs ∩ hp = ø. Define a composite potential Graph Constructions θφ(hc) = θg(hs)(hp) There are many existing and novel models that can where g : {0, 1}|hs| → {0,...,K − 1} assigns each set- be constructed by combining HOPs in different ways. ting of h to one of K partitions, and there is a differ- s Here we give a few examples of the range of models ent active potential θ depending on the partition. g(hs) that can be constructed using the HOP vocabulary. If we condition on g(hs) = k, the maximizations we   Pattern potentials are becoming popular in several P 0 need decouple into maxhs|g(hs)=k i0 mhi0 →θφ (hi ) + h i computer vision applications. b-of-N potentials can P 0 maxhp\hj θk(hp) + j06=j mhj0 →θφ (hj ) . In most be used: in general b-matching problems; to represent practical cases, |hs| is small (and as it grows, the po- the set cover problem in conjunction with a composite tential becomes an intractable non-structured HOP), potential that enforces an all-on or all-off constraint; so we do the first maximizations by enumerating all and as hard degree constraints on nodes in a structure θs possible values of hs, yielding M (k)—the maximum learning framework. of the first term conditioned on g(hs) = k. Cardinality potentials are useful in many cases beyond We use efficient message computations to compute b-of-N potentials or pattern potentials, though. Hard properly scaled max-marginals for the subset of vari- or soft parity potentials can be expressed by setting ables in hp relative to θk—this is just computing max- f(k) = 0 when k is even, and f(k) = −α when k is marginals for the star graph around θk, ignoring hs. odd. They can further be used to express nonpara- We know how to do this efficiently because θk is as- metric priors over cluster sizes in an exemplar-based sumed to be a tractable HOP. We define these values clustering setting; they can encourage a specific per- to be M θk (k) for potential k and h fixed to take on centage of pixels to be on in an image segmentation j;hj j value of 0 or 1. Finally, compute setting; and they can be used to represent a learned h i distribution of sizes for empirical image segmentation θs θk m˜ θφ→hj (0) = max M (k) + Mj;0(k) priors. k h i Before-after potentials are useful for representing user θs θk m˜ θφ→hj (1) = max M (k) + Mj;1(k) k preferences of document i over document j in a rank- ing setting; temporal ordering information in a time- by evaluating all combinations of k ∈ {0,...,K − 1} warped matching setting; soft word ordering and h ∈ {0, 1}, which requires computing all outgoing j preferences in a language parsing setting; or the soft messages from K tractable HOPs, yielding an amor- temporal ordering of subactivities in an activity recog- tized message cost of O(2|hs|+K) or O(2|hs|+K log N) nition setting. depending on the type of HOP. Note that the optimal value of k may be different for each maximization. The Convex set potentials are useful for representing con- messages to variables in hs are similar, but we omit straints that parts be convex in an image segmentation them due to space. setting; words to be contiguous in a language parsing setting; or objects appear in a contiguous region of A related composite factor is one where there is no time in a recognition-based tracking setting. preference for hs to take any particular value (i.e., for all h ∈ hs, h is only in the scope of one composite Importantly, because the underlying mechanism is factor). In this case, the HOPs are switched between message passing on standard factor graphs, any combi- based only on which is most likely, and since messages nation of these potentials or composed variations can to hs will only be useful to infer which potential was readily be combined and seamlessly integrated with chosen, we can choose not to compute them if we are standard local potentials. only interested in the assignment of the h variables. p Outer Loop Algorithms This is a max-composition potential, and the trun- cated linear deviation pattern potentials of Rother et Our contribution in this work lies in the inner loop of al. (2009) can be seen as a special case of this compo- inference—in the used to calculate mes- sition. sages in the message passing framework. The binary formulation of affinity propagation (Givoni We emphasize that several “outer loop” inference & Frey, 2009), for example, can be thought of as using routines can take advantage of these message- a composite potential for columns, where hs = {hjj}, computation subroutines. Most obviously, MPBP de- hp = {hj−j}, g(0) = 0, g(1) = 1, θ0 is an all-off po- fined on a cluster graph with single variable separa- tential, and θ1 = 0. tor sets can use the presented message computations

816 HOP-MAP: Efficient Message Passing with High Order Potentials

square region of the image, then a semi-occluded re- gion (horizontal bar) where there is little distinction from the background. We added uniform pairwise po- tentials to the 4-neighborhood of each pixel, and we experimented with different combinations of HOPs: no extra potentials; convex set potentials along hor- izontal, vertical, 45◦, and 135◦ diagonals; cardinality 642 potentials f(k) = −| 4 − k|; and a combination of both cardinality and convex set potentials. We run asynchronous belief propagation, using a typ- ical grid schedule, where we pass messages down then up along columns, to neighboring columns, then mov- ing across the image and repeating for the next col- umn. We follow this with a similar scheme for rows. After one round of these messages, we iterate through all HOPs and update all messages going into the cur- rent high order factor, then all messages outgoing from the HOP. We use damping of .5, and run until mes- Figure 1: Image segmentation results for a 64 x 64 sages converge. square using asynchronous max-product belief propa- Fig. 1 shows decoded beliefs for a variety of pairwise gation and combinations of high order cardinality and potential strengths and HOP combinations. Note that convexity potentials. Rows from top to bottom: only there is no setting of pairwise potentials that identifies unary and pairwise; unary, pairwise, and convexity; the region as one square (top row). However, when we unary, pairwise, and cardinality; unary, pairwise, con- add the HOPs, it becomes easy to express priors that vexity, and cardinality. Columns from left to right: encourage the single square interpretation. Addition- pairwise strength 0, .05, .1, .5. ally, the algorithm is reasoning in a non-trivial way, especially in the cardinality case as pairwise poten- without modification, though schedules that compute tials get stronger (third row). When pairwise poten- all outgoing messages from a high order factor at tials are weak (left columns), the algorithm chooses the once will be most efficient. Globerson and Jaakkola pixels with highest singleton potential to turn on. As (2008) give a generalized edge-based variant of max- the pairwise potentials get stronger (right columns), it product linear programming (GEMPLP) that can use must balance singleton potentials with spatial conti- standard HOP computations but requires that (easily guity. calculated) backwards messages be added to the out- Rank Aggregation put. Komodakis and Paragios (2009) mention a high order variant of tree-reweighted max-product (TRW) When ranking documents for information retrieval, it (Wainwright et al., 2005) that can use these computa- is common to use a set of document- and query-specific tions to directly compute the needed star graph max- features to learn a score for each document-query com- marginals. bination. When eliciting preferences from users, how- ever, it is difficult for them to produce a real-valued 5 EXPERIMENTS score for a query-document pair. Instead, users typi- cally can express pairwise preferences (i.e., they prefer We have explored HOP constructions for several prob- document i to document j for some query). lems in combinatorial optimization, ranking, and vi- sion. However, we leave a thorough exploration of We simulate this setting by building a query-specific these applications and comparison of outer-loop MAP model where N documents can take on one of N ranks, algorithms to future work. Here, we focus on an image represented as an N x N binary grid of variables hij. segmentation and a ranking example, demonstrating Each document has a score, si ∼ Uniform(0, 1), and we that high order models can be constructed easily and use unary potentials of the form θij(hij) = (N−j)·si to used within a variety of outer loop algorithms. express that we prefer documents with higher scores to be ranked higher (closer to 1). We simulate user Image Segmentation preferences by adding before-after potentials of ran- We generated 64 x 64 pixel synthetic data for a dom strength over 10% of randomly chosen pairs of foreground-background image segmentation task that rows. We also scale the overall strength of all before- simulates a fairly strong foreground response in a after potentials by multiplying by a common factor, λ,

817 Daniel Tarlow, Inmar E. Givoni, Richard S. Zemel

6.5 7 9 Scores Only Scores Only Scores Only panded upon several recent works. See Fig. 3 for an Dual Objective Dual Objective Dual Objective Primal Decoding Primal Decoding 8 6.5 Primal Decoding overview. Givoni and Frey (2009) use efficient up- 6 7 dates for various b-of-N potentials in a MPBP setting; 6 6 Huang and Jebara (2007) develop an efficient MPBP 5.5 5.5 5 algorithm for weighted bipartite b-matching and prove that the algorithm finds the optimal assignment; Duchi

5 5 4 0 100 200 300 0 200 400 600 0 50 100 150 et al. (2007) shows how combinatorial algorithms can (a) (b) (c) efficiently compute messages in models that have a full bipartite matching or associative MRF as a subprob- lem; and Komodakis and Paragios (2009) shows how to compute max-marginals for pattern-based potentials. Potetz (2007) and Tarlow et al. (2008) show how to compute messages for the general cardinality case and Werner (2008) briefly sketches a simple algorithm for λ = 0.01 λ = 0.05 λ = 0.25 doing so, but none gives the more efficient algorithm ρ = 13/18 ρ = 15/18 ρ = 18/18 we have presented here. (d) (e) (f) Other Decompositions

Figure 2: (a - c) Primal and dual objective for ranking Dual decomposition (DD) (Bertsekas, 1999) is a frame- using high order TRW. λ gives the strength of before- work for combining optimal subproblem solutions after potentials representing binary user preferences rather than max-marginals. Our potentials could eas- of one document over another. (d - f) Final assign- ily be used as subproblems in this framework, but ments found when λ is varied, all of which are glob- it would be simpler to directly compute subproblem ally optimal. Documents are rows, ranks are columns. assignments rather than doing so via max-marginals. Document-specific scores encourage document i to be DD has become popular very recently as an alterna- ranked in position i. ρ gives the fraction of soft pair- tive to belief propagation, but few direct comparisons wise order constraints that are satisfied. have been done on equivalent problem decompositions. Our formulation provides a means to perform a com-

giving us θ→(hi:, hk:; λ). This objective incorporates parison of DD and message passing approaches, be- both document-specific scores and elicited user prefer- cause it allows message passing algorithms to be de- ences. Note that if all document scores are zero, this fined on equivalent problem decompositions for many can represent the NP-hard Kemeny rank aggregation of the problems that DD has been applied to. problem (Bartholdi et al., 1989). Finally, certain restricted types of HOP have also been In addition, we add the full bipartite matching po- used in a graph cuts framework (Kohli et al., 2007, tential described in Duchi et al. (2007), enforcing the 2009; Delong & Boykov, 2009). When a problem can constraint that each document can only take on one be formulated to be submodular, these approaches will rank, and each rank can only be occupied by one docu- always find the global optimum. As models become ment. We do inference with the simple high-order gen- more complex, however, it takes more effort to manip- eralization of TRW given in Komodakis and Paragios ulate the model into a compliant form—often involving (2009). To decode assignments, we employ a greedy transformations, creation of auxiliary variables, and strategy that is guaranteed to enforce the uniqueness sometimes non-intuitive restrictions on the problem constraints: find the largest belief, bij, set the corre- formulation. Our work, in contrast, applies uniformly sponding hij = 1, then set beliefs in row i and column to a broader range of problems, and it is intuitive to j to −∞ and repeat until we have a full assignment. construct a model using the message passing formula- For comparison, we also compute the objective of an tion. However, we see discovering further connections algorithm that ranks based solely on score. The dual and developing hybrid algorithms that work with ideas objective always met the primal decoded objective, so from both approaches as an open direction of work. we stopped inference when we were guaranteed to have found the global optimum. Fig. 2 shows results for sev- 7 CONCLUSION eral λ. In many low order models where we can compute 6 RELATED WORK global optimums, the assignments still do not match ground truth. This indicates we need richer, possibly High Order Message Passing higher order representations. A challenge in working In this work, we put into a single framework and ex- with HOPs is that it is often difficult to recognize when

818 HOP-MAP: Efficient Message Passing with High Order Potentials

Special{ Cardinality Delong, A., & Boykov, Y. (2009). Globally optimal seg- General General b of N >/< b of N Pattern Cardinality Order Composition mentation of multi-region objects. In International [Giv] [H] Conference on Computer Vision, (ICCV). MPBP [Giv] * [T]** * * [C] Duchi, J., Tarlow, D., Elidan, G., & Koller, D. (2007). Us- MPBP Variants * * [Kom] [W]** * * ing combinatorial optimization within max-product [Vic] belief propagation. In Advances in Neural Informa- DD (OPT only) [Tor] * [Kom] * * [Gup] tion Processing Systems 19 (pp. 369–376). Cam- bridge, MA: MIT Press. Figure 3: Combinations of potentials and inference Givoni, I., & Frey, B. J. (2009). A binary variable model algorithms in the literature. OPT only means the for affinity propagation. Neural Computation, 21 (6), 1589-600. method does not use max-marginals. * Can express Globerson, A., & Jaakkola, T. (2008). Fixing max product: using presented HOPs and to our knowledge is novel. Convergent message passing algorithms for MAP ** We improve upon the efficiency of these calcula- LP-relaxations. In Neural Information Processing tions. [C] (Cheng et al., 2006), [G] (Givoni & Frey, Systems. 2009), [Gup] (Gupta et al., 2007), [H] (Huang & Je- Grant, M., Boyd, S., & Ye, Y. (2006). Global optimiza- tion: From theory to implementation (L. Liberti & bara, 2007), [Kom] (Komodakis & Paragios, 2009), [T] N. Maculan, Eds.). (Tarlow et al., 2008), [Tor] (Torresani et al., 2008), [W] Gupta, R., Diwan, A., & Sarawagi, S. (2007). Effi- (Werner, 2008), [Vic] (Vicente et al., 2009). cient inference with cardinality-based clique poten- tials. In International Conference on Machine Learn- a problem formulation will be tractable. Our vocabu- ing ICML (Vol. 227, p. 329-336). lary of common HOPs should make it easier to identify Huang, B., & Jebara, T. (2007). Loopy belief propaga- tion for bipartite maximum weight b-matching. In problem structures and formulate appropriate poten- The Eleventh International Conference on Artificial tials that can be incorporated into efficient message Intelligence and Statistics. passing MAP inference algorithms. Kohli, P., Kumar, M. P., & Torr, P. (2007). P3 and be- yond: Solving energies with higher order cliques. In We emphasize that the HOPs and composition rules Computer Vision and Pattern Recognition (CVPR). that we have presented are not exhaustive. There are Kohli, P., Ladick´y,L., & Torr, P. H. (2009). Robust higher likely to be many other classes of tractable HOP and order potentials for enforcing label consistency. Int. compositions. In these cases, the strategies presented J. Comput. Vision, 82 (3), 302–324. Komodakis, N., & Paragios, N. (2009). Beyond pair- here can likely be applied to build new classes of HOPs. wise energies: Efficient optimization for higher-order A limitation of our work as presented here is that fac- MRFs. In Computer Vision and Pattern Recognition (CVPR). tors only communicate with the rest of the network via Potetz, B. (2007). Efficient belief propagation for vision single variable interfaces. To get communication, for using linear constraint nodes. In Cvpr 2007: Proceed- example, between planar faces in a generalized MPBP ings of the 2007 ieee computer society conference on setting, and to get tighter LP relaxations, we would computer vision and pattern recognition. Minneapo- like our factors to communicate via separator sets of lis, MN, USA: IEEE Computer Society. Rother, C., Kohli, P., Feng, W., & Jia, J. (2009). Minimiz- several variables. In theory, there is nothing prevent- ing sparse higher order energy functions of discrete ing us from deriving such algorithms; this is an inter- variables. In Computer Vision and Pattern Recogni- esting direction of future work. tion (CVPR). Tarlow, D., Zemel, R., & Frey, B. (2008). Flexible pri- Acknowledgements ors for exemplar-based clustering. In Uncertainty in Artificial Intelligence. We thank Pushmeet Kohli and Victor Lempitsky for Torresani, L., Kolmogorov, ., & Rother, C. (2008). Fea- motivating application ideas and Amit Gruber for ture correspondence via graph matching: Models comments on an earlier version of the manuscript. and global optimization. In European Conference on Computer Vision, (ECCV). Vicente, S., Kolmogorov, V., & Rother, C. (2009). Joint References optimization of segmentation and appearance mod- els. In International Conference on Computer Vi- Bartholdi, J., Tovey, C. A., & Trick, M. A. (1989). Voting sion, (ICCV). schemes for which it can be difficult to tell who won Wainwright, M. J., Jaakkola, T., & Willsky, A. S. (2005). the election. Social Choice and Welfare, 6 (3). MAP estimation via agreement on trees: message- Bertsekas, D. (1999). Nonlinear Programming. passing and linear programming. IEEE Transactions Boutilier, C., Friedman, N., Goldszmidt, M., & Koller, D. on Information Theory, 51 (11), 3697-3717. (1996). Context-specific independence in bayesian Werner, T. (2008). High-arity interactions, polyhedral re- networks. In Uncertainty in artificial intelligence laxations, and cutting plane algorithm for soft con- (uai) (pp. 115–123). straint optimisation (MAP-MRF). In Computer Vi- Cheng, Y.-S., Neely, M. J., & Chugg, K. M. (2006). Iter- sion and Pattern Recognition (CVPR). ative message passing algorithm for bipartite maxi- mum weighted matching. In Proc. of IEEE Interna- tional Symposium on Information Theory.

819