Tensor Belief Propagation

Tensor Belief Propagation Andrew Wrigley 1 Wee Sun Lee 2 Nan Ye 3 Abstract sation problem and constructs approximations by solving relaxations of the optimisation problem. A number of well- We propose a new approximate inference algo- known inference algorithms can be seen as variational algorithm for graphical models, tensor belief prop- rithms, such as loopy belief propagation, mean-field varia- agation, based on approximating the messages tional inference, and generalized belief propagation (Wain- passed in the junction tree algorithm. Our al- wright & Jordan, 2008). The sampling approach uses sam- gorithm represents the potential functions of the pling to approximate either the underlying distribution or graphical model and all messages on the junction key quantities of interest. Commonly used sampling meth- tree compactly as mixtures of rank-1 tensors. Us- ods include particle filters and Markov-chain Monte Carlo ing this representation, we show how to perform (MCMC) algorithms (Andrieu et al., 2003). the operations required for inference on the junction tree efficiently: marginalisation can be com- Our proposed algorithm, tensor belief propagation (TBP), puted quickly due to the factored form of rank-1 can be seen as a sampling-based algorithm. Unlike parti- tensors while multiplication can be approximated cle filters or MCMC methods, which sample states (also using sampling. Our analysis gives sufficient known as particles), TBP samples functions in the form of conditions for the algorithm to perform well, in- rank-1 tensors. Specifically, we use a data structure com- cluding for the case of high-treewidth graphs, for monly used in exact inference, the junction tree, and per- which exact inference is intractable. We com- form approximate message passing on the junction tree us- pare our algorithm experimentally with several ing messages represented as mixtures of rank-1 tensors. approximate inference algorithms and show that We assume that each factor in the graphical model is orig- it performs well. inally represented as a tensor decomposition (mixture of rank-1 tensors). Under this assumption, all messages and intermediate factors also have the same representation. Our 1. Introduction key observation is that marginalisation can be performed efficiently for mixtures of rank-1 tensors, and multiplica- Probabilistic graphical models provide a general frame- tion can be approximated by sampling. This leads to an ap- work to conveniently build probabilistic models in a modu- proximate message passing algorithm where messages and lar and compact way. They are commonly used in statistics, intermediate factors are approximated by low-rank tensors. computer vision, natural language processing, machine learning and many related fields (Wainwright & Jordan, We provide analysis, giving conditions under which the 2008). The success of graphical models depends largely method performs well. We compare TBP experimentally on the availability of efficient inference algorithms. Unfor- with several existing approximate inference methods using tunately, exact inference is intractable in general, making Ising models, random MRFs and two real-world datasets, approximate inference an important research topic. demonstrating promising results. Approximate inference algorithms generally adopt a variational approach or a sampling approach. The variational 2. Related Work approach formulates the inference problem as an optimi- Exact inference on tree-structured graphical models can 1Australian National University, Canberra, Australia. be performed efficiently using belief propagation (Pearl, 2National University of Singapore, Singapore. 3Queensland 1982), a dynamic programming algorithm that involves University of Technology, Brisbane, Australia. Correspondence passing messages between nodes containing the results of to: Andrew Wrigley <[email protected]>, Wee Sun intermediate computations. For arbitrary graphical models, Lee <[email protected]>, Nan Ye <[email protected]>. the well-known junction tree algorithm (Shafer & Shenoy, Proceedings of the 34 th International Conference on Machine 1990; Lauritzen & Spiegelhalter, 1988; Jensen et al., 1990) Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 is commonly used. The model is first compiled into a junc- by the author(s). Tensor Belief Propagation tion tree data structure and a similar message passing algo- clusters that factor over subcliques or consist only of latent rithm is then run over the junction tree. Unfortunately, for variables yield improved complexity properties. non-tree models the time and space complexity of the junction tree algorithm grows exponentially with a property of 3. Preliminaries the graph called its treewidth. For high-treewidth graphical models, exact inference is intractable in general. For simplicity we limit our discussion to Markov random fields (MRFs), but our results apply equally to Our work approximates the messages passed in the junc- Bayesian networks and general factor graphs. We focus tion tree algorithm to avoid the exponential runtime caused only on discrete models. A Markov random field G is by exact computations in high-treewidth models. Various an undirected graph representing a probability distribu- previous work has taken the same approach. Expectation tion P (X ;:::;X ), such that P factorises over the max- propagation (EP) (Minka, 2001) approximates messages by 1 N cliques in G, i.e. minimizing the Kullback-Leiber (KL) divergence between the actual message and its approximation. Structured mes- 1 Y P (X ;:::;X ) = φ (X ) (1) sage passing (Gogate & Domingos, 2013) can be consid- 1 N Z c c ered as a special case of EP where structured representa- c2cl(G) tions, in particular algebraic decision diagrams (ADDs) and where cl(G) is the set of max-cliques in G and Z = sparse hash tables, are used so that EP can be performed P Q X c2cl(G) φc(Xc) ensures normalisation. We call the efficiently. In contrast to ADDs, the tensor decomposi- factors φc clique potentials or potentials. tions used for TBP may provide a more compact representation for some problems. An ADD partitions a tensor into TBP is based on the junction tree algorithm (see e.g. axis-aligned hyper-rectangles – it is possible to represent (Koller & Friedman, 2009)). A junction tree is a special a hyper-rectangle using a rank-1 tensor but a rank-1 ten- type of cluster graph, i.e. an undirected graph with nodes sor is generally not representable as an axis-aligned rectan- called clusters that are associated with sets of variables gle. Furthermore, the supports of the rank-1 tensors in the rather than single variables. Specifically, a junction tree is a mixture may overlap. However, ADD compresses hyper- cluster graph that is a tree and which also satisfies the run- rectangles that share sub-structures and this may result in ning intersection property. The running intersection prop- the methods having different strengths. Sparse tables, on erty states that if a variable is in two clusters, it must also be the other hand, work well for cases with extreme sparsity. in every cluster on the path that connects the two clusters. Several methods use sampled particles to approximate mes- The junction tree algorithm is essentially the well-known sages (Koller et al., 1999; Ihler & McAllester, 2009; Sud- belief propagation algorithm applied to the junction tree derth et al., 2010). To allow their algorithms to work well after the cluster potentials have been initialized. At ini- on problems with less sparsity, Koller et al.(1999) and Sud- tialisation, each clique potential is first associated with a derth et al.(2010) use non-parametric methods to smooth cluster. Each cluster potential Φt(Xt) is computed by mul- the particle representation of messages. In contrast, we de- tiplying all the clique potentials φc(Xc) associated with the compose each tensor into a mixture of rank-1 tensors and cluster Xt. Thereafter, the algorithm is defined recursively sample the rank-1 tensors directly, instead of through the in terms of messages passed between neighbouring clus- intermediate step of sampling particles. Another approach, ters. A message is always a function of the variables in the which pre-samples the particles at each node and passes receiving cluster, and represents an intermediate marginali- messages between these pre-sampled particles was taken sation over a partial set of factors. The message mt!s(Xs) by Ihler & McAllester(2009). The methods of Ihler & sent from a cluster t to a neighbouring cluster s is defined McAllester(2009) and Sudderth et al.(2010) were also ap- recursively by plied on graphs with loops using loopy belief propagation. X Y mt!s(Xs) = Φt(Xt) mu!t(Xt); (2) Xue et al.(2016) use discrete Fourier representations for in- XtnXs u2N(t)nfsg ference via the elimination algorithm. The discrete Fourier where N(t) is the set of neighbours of t. Since the junc- representation is a special type of tensor decomposition. In- tion tree is singly connected, this recursion is well-defined. stead of sampling, the authors perform approximations by After all messages have been computed, the marginal dis- truncating the Fourier coefficients, giving different approx- tribution on a cluster of variables X is computed using imation properties. Other related works include (Darwiche, s 2000; Park & Darwiche, 2002; Chavira & Darwiche, 2005), Y Ps(Xs) / Φs(Xs) mt!s(Xs); (3) where belief networks are compiled into compact arith- t2N(s) metic circuits (ACs). On the related problem of MAP inference, McAuley & Caetano(2011) show that junction tree and univariate marginals can be computed by summation over cluster marginals. The space and time complexity of Tensor Belief Propagation the junction tree inference algorithm is exponential in the of outer products of vectors as induced width of the graph, i.e. the number of variables in r the largest tree cluster minus 1 (Koller & Friedman, 2009). X 1 2 d T = wk ak ⊗ ak ⊗ · · · ⊗ ak; (4) The lowest possible induced width (over all possible junc- k=1 tion trees for the graph) is defined as the treewidth of the w 2 ai 2 Ni ⊗ graph1.

Load more