An Augmented Lagrangian Approach to Constrained MAP Inference
Total Page:16
File Type:pdf, Size:1020Kb
An Augmented Lagrangian Approach to Constrained MAP Inference Andr´eF. T. Martinsyz [email protected] M´arioA. T. Figueiredoz [email protected] Pedro M. Q. Aguiar] [email protected] Noah A. Smithy [email protected] Eric P. Xingy [email protected] ySchool of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA zInstituto de Telecomunica¸c~oes/ ]Instituto de Sistemas e Rob´otica,Instituto Superior T´ecnico,Lisboa, Portugal Abstract the graph structure in these relaxations (Wainwright We propose a new algorithm for approximate et al., 2005; Kolmogorov, 2006; Werner, 2007; Glober- MAP inference on factor graphs, by combin- son & Jaakkola, 2008; Ravikumar et al., 2010). In the ing augmented Lagrangian optimization with same line, Komodakis et al.(2007) proposed a method the dual decomposition method. Each slave based on the classical dual decomposition technique subproblem is given a quadratic penalty, (DD; Dantzig & Wolfe 1960; Everett III 1963; Shor which pushes toward faster consensus than in 1985), which breaks the original problem into a set previous subgradient approaches. Our algo- of smaller (slave) subproblems, splits the shared vari- rithm is provably convergent, parallelizable, ables, and tackles the Lagrange dual with the subgra- and suitable for fine decompositions of the dient algorithm. Initially applied in computer vision, graph. We show how it can efficiently han- DD has also been shown effective in NLP (Koo et al., dle problems with (possibly global) structural 2010). The drawback is that the subgradient algorithm constraints via simple sort operations. Ex- is very slow to converge when the number of slaves is periments on synthetic and real-world data large. This led Jojic et al.(2010) to propose an accel- show that our approach compares favorably erated gradient method by smoothing the objective. with the state-of-the-art. In this paper, we ally the simplicity of DD with the effectiveness of augmented Lagrangian methods, which have a long-standing history in optimization 1. Introduction (Hestenes, 1969; Powell, 1969; Glowinski & Marroco, 1975; Gabay & Mercier, 1976; Boyd et al., 2011). The Graphical models enable compact representations of result is a novel algorithm for approximate MAP infer- probability distributions, being widely used in com- ence: DD-ADMM (Dual Decomposition with the Al- puter vision, natural language processing (NLP), and ternating Direction Method of Multipliers). Rather computational biology (Koller & Friedman, 2009). A than placing all efforts in attempting progress in prevalent problem is the one of inferring the most prob- the dual, DD-ADMM looks for a saddle point of able configuration, the so-called maximum a posteriori the Lagrangian function, which is augmented with a (MAP). Unfortunately, this problem is intractable, ex- quadratic term to penalize slave disagreements. Key cept for a limited class of models. This fact precludes features of DD-ADMM are: computing the MAP exactly in many important mod- els involving non-local features or requiring structural • It is suitable for heavy parallelization (many slaves); constraints to ensure valid predictions. • it is provably convergent, even when each slave sub- A significant body of research has thus been placed on problem is only solved approximately; approximate MAP inference, e.g., via linear program- • consensus among slaves is fast, by virtue of the ming relaxations (Schlesinger, 1976). Several message- quadratic penalty term, hence it exhibits faster con- passing algorithms have been proposed that exploit vergence in the primal than competing methods; Appearing in Proceedings of the 28 th International Con- • in addition to providing an optimality certificate for ference on Machine Learning, Bellevue, WA, USA, 2011. the exact MAP, it also provides guarantees that the Copyright 2011 by the author(s)/owner(s). LP-relaxed solution has been found. An Augmented Lagrangian Approach to Constrained MAP Inference After providing the necessary background (Sect.2) which will be our main focus throughout. Obviously, and introducing and analyzing DD-ADMM (Sect.3), OPT0 ≥ OPT, since L(G) ⊇ M(G). we turn to the slave subproblems (Sect.4). Of partic- ular concern to us are problems with structural con- 2.2. Dual Decomposition straints, which arise commonly in NLP, vision, and other structured prediction tasks. We show that, for Several message passing algorithms (Wainwright et al., several important constraints, each slave can be solved 2005; Kolmogorov, 2006; Globerson & Jaakkola, 2008) exactly and efficiently via sort operations. Experi- are derived via some reformulation of (4) followed by ments with pairwise MRFs and dependency parsing dualization. The DD method (Komodakis et al., 2007) a (Sect.5) testify to the success of our approach. reformulates (4) by adding new variables νi (for each factor a and i 2 N(a)) that are local \replicas" of the marginals µ . Letting N(i) faji 2 N(a)g and 2. Background i , di = jN(i)j (the degree of node i), (4) is rewritten as 2.1. Problem Formulation P P −1 > a > max d θi ν + φa νa (5) ν;µ a i2N(a) i i Let X , (X1;:::;XN ) 2 X be a vector of discrete s.t. (νa ; ν ) 2 M(G ); 8a random variables, where each Xi 2 Xi, with Xi a fi- N(a) a a a nite set. We assume that X has a Gibbs distribution νi = µi; 8a; i 2 N(a); associated with a factor graph G (Kschischang et al., 2001), composed of a set of variable nodes f1;:::;Ng where Ga is the subgraph of G comprised only of and a set of factor nodes A, with each a 2 A linked to factor a and the variables in N(a), M(Ga) is the a subset of variables N(a) ⊆ f1;:::;Ng: corresponding marginal polytope, and we denote a a νN(a) , (νi )i2N(a). (Note that by definition, L(G) = PN P Pθ;φ(x) / exp θi(xi) + φa(xa) : (1) i=1 a2A f(µ; ν) j (µNa ; νa) 2 M(Ga); 8a 2 Ag.) Problem (5) would be completely separable (over the factors) if it Above, xa stands for the subvector indexed by the el- a were not the \coupling" constraints νi = µi. Intro- ements of N(a), and θi(:) and φa(:) are, respectively, a ducing Lagrange multipliers λi for these constraints, unary and higher-order log-potential functions. To ac- the dual problem (master) becomes commodate hard constraints, we allow these functions to take values in R [ {−∞}. For simplicity, we write P −1 a min L(λ) a sa di θi + λi ; φa λ , i2N(a) θi , (θi(xi))xi2Xi and φa , (φa(xa))xa2Xa . s.t. λ 2 Λ λ P λa = 0; 8i ; (6) We are interested in the task of finding the most prob- , a2N(i) i able assignment (the MAP), x^ arg maxx2X Pθ;φ(x). , where each sa corresponds to a slave subproblem This (in general NP-hard) combinatorial problem can X be transformed into a linear program (LP) by in- s (!a ; φ ) max !>νa + φ>ν : (7) a N(a) a , a i i a a n (ν ;νa) troducing marginal variables µ , (µi)i=1 and ν , N(a) i2N(a) 2M(Ga) (νa)a2A, constrained to the marginal polytope of G, i.e., the set of realizable marginals (Wainwright & Jor- Note that the slaves (7) are MAP problems of the same dan, 2008). Denoting this set by M(G), this yields kind as (2), but local to each factor a. Denote by a a P > P > (ν^N(a); ν^a) = map(! ; φa) the maximizer of (7). OPT , max i θi µi + a φa νa; (2) N(a) (µ;ν)2M(G) The master problem (6) can be addressed elegantly which always admits an integer solution. Unfortu- with a projected subgradient algorithm: note that a nately, M(G) often lacks a concise representation, subgradient rλa L(λ) is readily available upon solving i a the ath slave, via r a L(λ) = ν^ . These slaves can which renders (2) intractable. A common workaround λi i is to replace M(G) by the outer bound L(G) ⊇ M(G)| be handled in parallel and then have their solutions the so-called local polytope, defined as gathered for computing a projection onto Λ, which is simply a centering operation. This results in Alg.1. 8 > 9 < 1 µi = 1; 8i ^ = Alg.1 inherits the properties of subgradient algo- L(G) = (µ; ν) Hiaνa = µi; 8a; i 2 N(a) ^ ; : νa ≥ 0; 8a ; rithms, hence it converges to the optimal value of 0 (3) OPT in (4) if the stepsize sequence (ηt)t2T is dimin- where Hia(xi; xa) = 1 if [xa]i = xi, and 0 otherwise. ishing and nonsummable (Bertsekas et al., 1999). In This yields the following LP relaxation of (2): practice, convergence can be quite slow if the number 0 P > P > of slaves is large. This is because it may be hard to OPT , max i θi µi + a φa νa; (4) (µ;ν)2L(G) reach a consensus on variables with many replicas. An Augmented Lagrangian Approach to Constrained MAP Inference Algorithm 1 DD-Subgradient Algorithm 2 DD-ADMM 1: input: factor graph G, parameters θ; φ, number of 1: input: factor graph G, parameters θ; φ, number of T T iterations T , stepsize sequence (ηt)t=1 iterations T , sequence (ηt)t=1, parameter τ 2: Initialize λ = 0 2: Initialize µ uniformly, λ = 0 3: for t = 1 to T do 3: for t = 1 to T do 4: for each factor a 2 A do 4: for each factor a 2 A do a −1 a a −1 a 5: Set !i = di θi + λi , for i 2 N(a) 5: Set !i = di θi + λi + ηtµi, for i 2 N(a) a a a a 6: Compute (ν^N(a); ν^a) = map !N(a); φa 6: Update (νN(a); νa) quadηt !N(a); φa 7: end for 7: end for −1 P a −1 P a −1 a 8: Compute average µi = di a:i2N(a) ν^i 8: Update µi di a:i2N(a) νi − ηt λi a a a a a a 9: Update λi λi − ηt (ν^i − µi) 9: Update λi λi − τηt (νi − µi) 10: end for 10: end for 11: output: λ 11: output: µ; ν; λ 3.