HOP-MAP: Efficient Message Passing with High Order Potentials

HOP-MAP: Efficient Message Passing with High Order Potentials Daniel Tarlow Inmar E. Givoni Richard S. Zemel Dept. of Computer Science Probabilistic & Statistical Dept. of Computer Science University of Toronto Inference Group University of Toronto University of Toronto Abstract When potentials begin to range over a large subset of variables, however, these methods quickly break down: There is a growing interest in building prob- for the general problem, message updates from a high abilistic models with high order potentials order clique take time exponential in the size of the (HOPs), or interactions, among discrete vari- clique. One approach in such cases is to transform ables. Message passing inference in such the problem into a pairwise problem by adding aux- models generally takes time exponential in iliary variables. In the worst case, this will increase the size of the interaction, but in some cases the problem size exponentially. Rother et al. (2009) maximum a posteriori (MAP) inference can give transformations that are practical for the spe- be carried out efficiently. We build upon such cial case of sparse potentials. Alternatively, special- results, introducing two new classes, includ- purpose computations can sometimes be performed ing composite HOPs that allow us to flexibly directly on the original high order|typically factor combine tractable HOPs using simple logi- graph|representation (Givoni & Frey, 2009), which cal switching rules. We present efficient mes- is the approach we take in this paper. This strategy sage update algorithms for the new HOPs, applies beyond sparse potentials and is applicable for and we improve upon the efficiency of mes- a wide range of special potential structures. Unfortu- sage updates for a general class of existing nately, for a given potential it is typically not imme- HOPs. Importantly, we present both new diately clear whether it has tractable structure. Even and existing HOPs in a common represen- if it does, message updates are case-specific and must tation; performing inference with any combi- be specially derived. nation of these HOPs requires no change of Our goal is to be able to use a broad range of high representations or new derivations. order potentials (HOPs) generically within MAP message passing algorithms as easily as we use low order tabular potentials. We see three issues that must be 1 INTRODUCTION addressed: Probabilistic graphical models are powerful tools due 1. Message updates need to be computed efficiently to their representational power, and also due to general even when factors contain very large scopes. purpose algorithms that can be applied to any (low order) graphical model. For a broad range of problems, 2. It should be easy to recognize when problems con- we can formulate a model in terms of graph struc- tain tractable high order structure. tures and standard potentials. Then, without further derivation, we can automatically perform inference in 3. HOP constructions should be flexible and re- the model. In particular, when the aim is to find a suable, not requiring new problem-specific deriva- most likely configuration of variables (MAP), a range tions and implementations on each use. of efficient message passing algorithms can be applied (Wainwright et al., 2005; Werner, 2008; Globerson & In Section 3, we describe the classes of atomic, building Jaakkola, 2008). block HOPs that we consider, cardinality-based and order-based. Cardinality-based potentials have been Appearing in Proceedings of the 13th International Con- used in several related works, and efficient message ference on Artificial Intelligence and Statistics (AISTATS) computations exist for both the general case and sev- 2010, Chia Laguna Resort, Sardinia, Italy. Volume 9 of eral restricted classes. Our first contribution, however, JMLR: W&CP 9. Copyright 2010 by the authors. is showing that the efficiency can be improved even 812 HOP-MAP: Efficient Message Passing with High Order Potentials beyond existing efficient computations (Potetz, 2007; sent HOPs over subsets. θj : hj 2 f0; 1g ! R and jcj Tarlow et al., 2008). Our algorithm computes all mes- θc : hc 2 f0; 1g ! R assign a real value to a variable N sages in O(N log N) time, a factor of log N better than or subset assignment, respectively. The potentials we existing approaches. For the novel class of order-based present can range over any number of variables, so we potentials, we present equally efficient algorithms. For use N to generically represent jcj. the atomic potentials we consider, finding the optimal A factor graph, then, defines a (log) likelihood that assignment takes O(N) or O(N log N) time. We show takes the form that our algorithms can compute all N outgoing mes- n sages in the same asymptotic time. X X L(h) = θj(hj) + θc(hc) (1) Next, analogous in spirit to disciplined convex pro- j=1 c2C gramming, which allows many of the manipulations and transformations required to analyze and solve con- The MAP inference problem|to find a setting vex programs to be automated (Grant et al., 2006), we of h that maximizes the likelihood hOP T = introduce two types of composite HOPs, which allow us arg maxh L(h)|is NP-hard for general loopy graphs, to build more complex HOPs by applying composition and we typically must resort to approximate optimiza- rules based on maximization or simple logical switches. tion methods. A complex HOP can be recognized to be tractable by decomposing it into atomic tractable units combined 2.1 MAP MESSAGE PASSING with allowed composition rules. Importantly, once ex- pressed as a composition, the message updates of com- Max-product belief propagation (MPBP) is an itera- posite HOPs can automatically be computed from the tive, local, message passing algorithm that can be used message computations of the atomic HOPs. to find the MAP configuration of a probability distri- bution specified by a tree-structured graphical model. Our final, more subtle contribution is the particular binary representation, message normalization scheme, When working in log space, the algorithm is known as and caching strategy for computing all outgoing mes- max-sum, and the updates involve sending messages sages from a factor at once. This \framework," which from factors to variables, we refer to throughout, is not novel, but it is also not 2 3 X the standard. Each part has a purpose: the binary rep- 0 m~ θc!hj (hj) = max 4θc(hc) + m~ hj0 !θc (hj )5 , hcnfhj g resentation exposes more structure in potentials; the j02cjj06=j message normalization scheme yields simpler derivations; and the caching strategy leads to more efficient and from variables to factors,m ~ h !θ (hj) = P j c 0 m~ (h ). After a forward and back- algorithms. It should be straightforward to apply this c 2N (hj )nc θc0 !hj j strategy to other structured potentials to build new ward pass sending messages to and from a root atomic building block HOPs. node, optimal assignments can be decoded from OP T beliefs, hj = arg maxhj b(hj), where b(hj) = Section 4 shows that well-known and novel graph con- P 0 m~ θ 0 !h (hj). In tree-structured graphs, be- structions are easily expressible using the vocabulary c 2N (hj ) c j liefs defined in this way give (up to a constant) max- of potentials discussed, and that once a model is ex- marginals:Φ = max L(h). In loopy graphs, pressed in this framework, it can be used in a variety j;a hjhj =a beliefs produce pseudo max-marginals, which do not of MAP inference procedures. In Section 5 we present account for the loopy graph structure but can be de- experimental results that illustrate the ease of con- coded to give approximate solutions, which have been structing models of interest with this formulation. shown to be useful in practice. 2 REPRESENTATION 2.2 BINARY REPRESENTATION Note that we have restricted our attention to binary We work with a factor graph representation, which is variable problems. To represent multinomial variables, a bipartite graph consisting of variable nodes, h = we apply a simple transformation, converting variables fh ; : : : ; h g, and factor nodes. Let N (h ) be the 1 n j with L states to binary variables with a 1-of-L con- neighbors of variable h in the graph. Factors, or j straint ensuring that exactly one variable in the set is potentials, θ = fθ ; : : : ; θ ; θ ; : : : ; θ g, define in- 1 n n+1 n+k on. teractions over individual variables hj and subsets of variables C = fc1; : : : ; ckg; c ⊆ fh1; : : : ; hng, which Since all variables are binary, we normalize messages are exactly the factor's neighbors in the factor graph. so that the entry for hj = 0 is always 0. To enforce this With slight abuse of notation, we use θj to represent constraint, we subtractm ~ θ !h (0) from both coordi- c j node potentials over single variables, and θc to repre- nates, giving us 0; m~ θc!hj (1) − m~ θc!hj (0) . We can 813 Daniel Tarlow, Inmar E. Givoni, Richard S. Zemel then drop the argument for mθc!hj (1) and use mθc!hj potentials) over cardinality in O(N log N) time by to represent the scalar message difference. Similarly, a simple sorting then greedy procedure. We could we assume θj(0) = 0 for all node potentials by setting compute max-marginals for the star graph naively in ~ ~ 2 θj = θj(1) = θj(1) − θj(0). O(N log N) time by running this procedure itera- tively fixing at each iteration one value of one variable. We are then working with message differences, where a positive value indicates a variable's preference to If we use MPBP, Potetz (2007) shows that a sin- be on, and a negative value indicates a variable's gle message can be approximately computed in O(N) preference to be off.

Load more