Block Belief Propagation for Parameter Learning in Markov Random Fields

The Thirty-Third AAAI Conference on Artificial Intelligence (AAAI-19) Block Belief Propagation for Parameter Learning in Markov Random Fields You Lu Zhiyuan Liu Bert Huang Department of Computer Science Department of Computer Science Department of Computer Science Virginia Tech University of Colorado Boulder Virginia Tech Blacksburg, VA Boulder, CO Blacksburg, VA [email protected] [email protected] [email protected] Abstract network into several small blocks. At each iteration of learning, it selects a block and computes its marginals. It approx- Traditional learning methods for training Markov random imates the gradient with a mix of the updated and the previ- fields require doing inference over all variables to compute ous marginals, and it updates the parameters of interest with the likelihood gradient. The iteration complexity for those methods therefore scales with the size of the graphical mod- this gradient. els. In this paper, we propose block belief propagation learning (BBPL), which uses block-coordinate updates of approx- Related Work imate marginals to compute approximate gradients, removing Many methods have been developed to learn MRFs. In this the need to compute inference on the entire graphical model. section, we cover only the BP-based methods. Thus, the iteration complexity of BBPL does not scale with Mean-field variational inference and belief propagation the size of the graphs. We prove that the method converges to the same solution as that obtained by using full inference (BP) approximate the partition function with non-convex per iteration, despite these approximations, and we empiri- entropies, which break the convexity of the original par- cally demonstrate its scalability improvements over standard tition function. In contrast, convex BP (Globerson and training methods. Jaakkola 2007; Heskes 2006; Schwing et al. 2011; Wain- wright, Jaakkola, and Willsky 2005; Wainwright 2006) pro- vides a strongly convex upper bound for the partition func- Introduction tion. This strong convexity has also been theoretically shown Markov random fields (MRFs) and conditional random to be beneficial for learning (London, Huang, and Getoor fields (CRFs) are powerful classes of models for learning 2015). Thus, our BBPL method uses convex BP to approxi- and inference of factored probability distributions (Koller mate the partition function. and Friedman 2009; Wainwright and Jordan 2008). They Regarding inference, some methods are developed to ac- have been widely used in tasks such as structured pre- celerate computations of the beliefs and messages. Stochas- diction (Taskar, Guestrin, and Koller 2004) and computer tic BP (Noorshams and Wainwright 2013) updates only one vision (Nowozin and Lampert 2011). Traditional training dimension of the messages at each inference iteration, so methods for MRFs learn by maximizing an approximate its iteration complexity is much lower than traditional BP. maximum likelihood. Many such methods use variational in- Distributed BP (Schwing et al. 2011; Yin and Gao 2014) ference to approximate the crucial partition function. distributes and parallelizes the computation of beliefs and With MRFs, the gradient of the log likelihood with respect messages on a cluster of machines to reduce inference time. to model parameters is the marginal vector. With CRFs, it Sparse-matrix BP (Bixler and Huang 2018) uses sparse- is the expected feature vector. These identities suggest that matrix products to represent the message-passing indexing, each iteration of optimization must involve computation of so that it can be implemented on modern hardware. How- the full marginal vector, containing the estimated marginal ever, to learn MRF parameters, we need to run these infer- probabilities of all variables and all dependent groups of ence algorithms for many iterations on the whole network variables. In some applications, the number of variables can until convergence at each parameter update. Thus, these be massive, making traditional, full-inference learning too methods are still impacted by the network size. expensive in practice. This problem limits the application of Regarding learning, many learning frameworks have been MRFs in modern data science tasks. proposed to efficiently learn MRF parameters. Some ap- In this paper, we propose block belief propagation learn- proaches use neural networks to directly estimate the mes- ing (BBPL), which alleviates the cost of learning by comput- sages (Lin et al. 2015; Ross et al. 2011). Methods that trun- ing approximate gradients with inference over only a small cate message passing (Domke 2011; 2013; Stoyanov, Rop- block of variables at a time. BBPL first separates the Markov son, and Eisner 2011) redefine the loss function in terms of the approximate inference results obtained after message Copyright c 2019, Association for the Advancement of Artificial passing is truncated to a small number of iterations. These Intelligence (www.aaai.org). All rights reserved. methods can be inefficient because they require running in- 4448 ference on the full network for a fixed number of iterations Convex Belief Propagation for MRFs or until convergence. Moreover, they yield non-convex ob- Let x = [x1; :::; xn] be a discrete random vector taking jectives without convergence guarantees. values in X = X1 × · · · × Xn, and let G = (V; E) Some methods restrict the complexity of inference on a be the corresponding undirected graph, with the vertex set subnetwork at each parameter update. Lifted BP (Kersting, V = f1; :::; ng and edge set E ⊂ V ×V . Potential functions Ahmadi, and Natarajan 2009; Singla and Domingos 2008) θs : Xs !R and θuv : Xu × Xv !R are differentiable makes use of the symmetry structure of the network to group functions with parameters we want to learn. The probability nodes into super-nodes, and then it runs a modified BP on density function of a pairwise Markov random field (Wain- this lifted network. A similar approach (Ahmadi, Kersting, wright and Jordan 2008) can be written as and Natarajan 2012) uses a lifted online training framework that combines the advantages of lifted BP and stochastic gra- X X p(xjθ) = expf θs(xs) + θuv(xu; xv) − A(θ)g: dient methods to train Markov logic networks (Richardson s2V (u;v)2E and Domingos 2006). These lifting approaches rely on sym- metries in relational structure—which may not be present The log partition function X X X in many applications—to reduce computational costs. They A(θ) = log expf θ (x ) + θ (x ; x )g also require extra algorithms to construct the lifted network, s s uv u v which increase the difficulty of implementation. Piecewise X s2V (u;v)2E training separates a network into several possibly overlap- (1) G ping sets of pieces, and then uses the piecewise pseudo- is intractable in most situations when is not a tree. One can likelihood as the objective function to train the model (Sut- approximate the log partition function with a convex upper ton and McCallum 2009). Decomposed learning (Samdani bound: B(θ) = max fhθ; τi − B∗(τ)g; (2) and Roth 2012) uses a similar approach to train structured τ2L(G) support vector machines. However, these methods need to where modify the structure of the network by decomposing it, thus changing the objective function. θ = fθsjs 2 V g [ fθuvj(u; v) 2 Eg; Finally, inner-dual learning methods (Bach et al. 2015; τ = fτsjs 2 V g [ fτuvj(u; v) 2 Eg; Hazan and Urtasun 2010; Hazan, Schwing, and Urtasun d X 2016; Meshi et al. 2010; Taskar et al. 2005) interleave pa- L(G) := fτ 2 R+j τs(xs) = 1; rameter optimization and inference to avoid repeated infer- xs X ences during training. These methods are fast and conver- τuv(xu; xv) = τu(xu)g: gent in many settings. However, in other settings, key bot- xv tlenecks are not fully alleviated, since each partial inference still runs on the whole network, causing the iteration cost to The vector τ is called the pseudo-marginal or belief vec- scale with the network size. tor. Specifically, τs is the unary belief of vertex s, and τuv is the pairwise belief of edge (u; v). The local marginal poly- tope L restricts the unary beliefs to be consistent with their Contributions connected pairwise beliefs. We consider a variant of belief ∗ In contrast to many existing approaches, our proposed propagation where B (τ) is strongly convex and has the fol- method uses the same convex inference objective as tradi- lowing form: tional learning, providing the same benefits from learning with a strongly convex objective. BBPL thus scales well to ∗ X X B (τ) = ρsH(τs) + ρuvH(τuv); large networks because it runs convex BP on a block of vari- s2V (u;v)2E ables, fixing the variables in other blocks. This block update guarantees that the iteration complexity does not in- where ρs and ρuv are parameters known as counting num- crease with the network size. BBPL preserves communica- bers, and H(:) is the entropy. tion across the blocks, allowing it to optimize the original Equation 2 can be solved via convex BP (Meshi et al. objective that computes inference over the entire network at 2009; Yedidia, Freeman, and Weiss 2005). Let λuv be the convergence. The update rules of BBPL are similar in form message from vertex u to vertex v. The update rules of mes- to full BP learning, which makes it as easy to implement as sages and beliefs are as follows: traditional MRF or CRF training. Finally, we theoretically X 1 prove that BBPL converges to the same solution under mild λuv = ρuv log expf (θuv − λvu) + log τug; (3) ρ conditions defined by the inference assumption. Our exper- u uv iments empirically show that BBPL does converge to the where same optimum as full BP. 1 X τu / expf (θu + λvu)g; (4) ρu v2N(u) Background and In this section, we introduce notation and background 1 knowledge directly related to our work.

Load more