Tensor Belief Propagation

Andrew Wrigley 1 Wee Sun Lee 2 Nan Ye 3

Abstract sation problem and constructs approximations by solving relaxations of the optimisation problem. A number of well- We propose a new approximate inference algo- known inference can be seen as variational algo- rithm for graphical models, tensor belief prop- rithms, such as loopy belief propagation, mean-field varia- agation, based on approximating the messages tional inference, and generalized belief propagation (Wain- passed in the junction . Our al- wright & Jordan, 2008). The sampling approach uses sam- gorithm represents the potential functions of the pling to approximate either the underlying distribution or and all messages on the junction key quantities of interest. Commonly used sampling meth- tree compactly as mixtures of rank-1 tensors. Us- ods include particle filters and Markov-chain Monte Carlo ing this representation, we show how to perform (MCMC) algorithms (Andrieu et al., 2003). the operations required for inference on the junc- tion tree efficiently: marginalisation can be com- Our proposed algorithm, tensor belief propagation (TBP), puted quickly due to the factored form of rank-1 can be seen as a sampling-based algorithm. Unlike parti- tensors while multiplication can be approximated cle filters or MCMC methods, which sample states (also using sampling. Our analysis gives sufficient known as particles), TBP samples functions in the form of conditions for the algorithm to perform well, in- rank-1 tensors. Specifically, we use a data structure com- cluding for the case of high-treewidth graphs, for monly used in exact inference, the junction tree, and per- which exact inference is intractable. We com- form approximate message passing on the junction tree us- pare our algorithm experimentally with several ing messages represented as mixtures of rank-1 tensors. approximate inference algorithms and show that We assume that each factor in the graphical model is orig- it performs well. inally represented as a tensor decomposition (mixture of rank-1 tensors). Under this assumption, all messages and intermediate factors also have the same representation. Our 1. Introduction key observation is that marginalisation can be performed efficiently for mixtures of rank-1 tensors, and multiplica- Probabilistic graphical models provide a general frame- tion can be approximated by sampling. This leads to an ap- work to conveniently build probabilistic models in a modu- proximate message passing algorithm where messages and lar and compact way. They are commonly used in statistics, intermediate factors are approximated by low-rank tensors. computer vision, natural language processing, machine learning and many related fields (Wainwright & Jordan, We provide analysis, giving conditions under which the 2008). The success of graphical models depends largely method performs well. We compare TBP experimentally on the availability of efficient inference algorithms. Unfor- with several existing approximate inference methods using tunately, exact inference is intractable in general, making Ising models, random MRFs and two real-world datasets, approximate inference an important research topic. demonstrating promising results. Approximate inference algorithms generally adopt a vari- ational approach or a sampling approach. The variational 2. Related Work approach formulates the inference problem as an optimi- Exact inference on tree-structured graphical models can 1Australian National University, Canberra, Australia. be performed efficiently using belief propagation (Pearl, 2National University of Singapore, Singapore. 3Queensland 1982), a dynamic programming algorithm that involves University of Technology, Brisbane, Australia. Correspondence passing messages between nodes containing the results of to: Andrew Wrigley , Wee Sun intermediate computations. For arbitrary graphical models, Lee , Nan Ye . the well-known (Shafer & Shenoy, Proceedings of the 34 th International Conference on Machine 1990; Lauritzen & Spiegelhalter, 1988; Jensen et al., 1990) Learning, Sydney, Australia, PMLR 70, 2017. Copyright 2017 is commonly used. The model is first compiled into a junc- by the author(s). Tensor Belief Propagation tion tree data structure and a similar message passing algo- clusters that factor over subcliques or consist only of latent rithm is then run over the junction tree. Unfortunately, for variables yield improved complexity properties. non-tree models the time and space complexity of the junc- tion tree algorithm grows exponentially with a property of 3. Preliminaries the graph called its treewidth. For high-treewidth graphical models, exact inference is intractable in general. For simplicity we limit our discussion to Markov ran- dom fields (MRFs), but our results apply equally to Our work approximates the messages passed in the junc- Bayesian networks and general factor graphs. We focus tion tree algorithm to avoid the exponential runtime caused only on discrete models. A Markov random field G is by exact computations in high-treewidth models. Various an undirected graph representing a probability distribu- previous work has taken the same approach. Expectation tion P (X ,...,X ), such that P factorises over the max- propagation (EP) (Minka, 2001) approximates messages by 1 N cliques in G, i.e. minimizing the Kullback-Leiber (KL) divergence between the actual message and its approximation. Structured mes- 1 Y P (X ,...,X ) = φ (X ) (1) sage passing (Gogate & Domingos, 2013) can be consid- 1 N Z c c ered as a special case of EP where structured representa- c∈cl(G) tions, in particular algebraic decision diagrams (ADDs) and where cl(G) is the set of max-cliques in G and Z = sparse hash tables, are used so that EP can be performed P Q X c∈cl(G) φc(Xc) ensures normalisation. We call the efficiently. In contrast to ADDs, the tensor decomposi- factors φc clique potentials or potentials. tions used for TBP may provide a more compact represen- tation for some problems. An ADD partitions a tensor into TBP is based on the junction tree algorithm (see e.g. axis-aligned hyper-rectangles – it is possible to represent (Koller & Friedman, 2009)). A junction tree is a special a hyper-rectangle using a rank-1 tensor but a rank-1 ten- type of cluster graph, i.e. an undirected graph with nodes sor is generally not representable as an axis-aligned rectan- called clusters that are associated with sets of variables gle. Furthermore, the supports of the rank-1 tensors in the rather than single variables. Specifically, a junction tree is a mixture may overlap. However, ADD compresses hyper- cluster graph that is a tree and which also satisfies the run- rectangles that share sub-structures and this may result in ning intersection property. The running intersection prop- the methods having different strengths. Sparse tables, on erty states that if a variable is in two clusters, it must also be the other hand, work well for cases with extreme sparsity. in every cluster on the path that connects the two clusters. Several methods use sampled particles to approximate mes- The junction tree algorithm is essentially the well-known sages (Koller et al., 1999; Ihler & McAllester, 2009; Sud- belief propagation algorithm applied to the junction tree derth et al., 2010). To allow their algorithms to work well after the cluster potentials have been initialized. At ini- on problems with less sparsity, Koller et al.(1999) and Sud- tialisation, each clique potential is first associated with a derth et al.(2010) use non-parametric methods to smooth cluster. Each cluster potential Φt(Xt) is computed by mul- the particle representation of messages. In contrast, we de- tiplying all the clique potentials φc(Xc) associated with the compose each tensor into a mixture of rank-1 tensors and cluster Xt. Thereafter, the algorithm is defined recursively sample the rank-1 tensors directly, instead of through the in terms of messages passed between neighbouring clus- intermediate step of sampling particles. Another approach, ters. A message is always a function of the variables in the which pre-samples the particles at each node and passes receiving cluster, and represents an intermediate marginali- messages between these pre-sampled particles was taken sation over a partial set of factors. The message mt→s(Xs) by Ihler & McAllester(2009). The methods of Ihler & sent from a cluster t to a neighbouring cluster s is defined McAllester(2009) and Sudderth et al.(2010) were also ap- recursively by plied on graphs with loops using loopy belief propagation. X Y mt→s(Xs) = Φt(Xt) mu→t(Xt), (2)

Xue et al.(2016) use discrete Fourier representations for in- Xt\Xs u∈N(t)\{s} ference via the elimination algorithm. The discrete Fourier where N(t) is the set of neighbours of t. Since the junc- representation is a special type of tensor decomposition. In- tion tree is singly connected, this recursion is well-defined. stead of sampling, the authors perform approximations by After all messages have been computed, the marginal dis- truncating the Fourier coefficients, giving different approx- tribution on a cluster of variables X is computed using imation properties. Other related works include (Darwiche, s 2000; Park & Darwiche, 2002; Chavira & Darwiche, 2005), Y Ps(Xs) ∝ Φs(Xs) mt→s(Xs), (3) where belief networks are compiled into compact arith- t∈N(s) metic circuits (ACs). On the related problem of MAP infer- ence, McAuley & Caetano(2011) show that junction tree and univariate marginals can be computed by summation over cluster marginals. The space and time complexity of Tensor Belief Propagation the junction tree inference algorithm is exponential in the of outer products of vectors as induced width of the graph, i.e. the number of variables in r the largest tree cluster minus 1 (Koller & Friedman, 2009). X 1 2 d T = wk ak ⊗ ak ⊗ · · · ⊗ ak, (4) The lowest possible induced width (over all possible junc- k=1 tion trees for the graph) is defined as the treewidth of the w ∈ ai ∈ Ni ⊗ graph1. where k R, k R and is the outer product, i.e. a1 ⊗ a2 ⊗ · · · ⊗ ad = a1  ····· ad . We k k k i1,...,id k i1 k id denote the vector of weights {wk} as w. The smallest r 4. Tensor Belief Propagation for which an exact r-term decomposition exists is called The TBP algorithm we propose (Algorithm1) is the same the rank of T and a decomposition (4) using this r is a ten- as the junction tree algorithm except for the approximations sor rank decomposition. This decomposition is known by at line 1 and line 4, which we describe below. several names including CANDECOMP/PARAFAC (CP) decomposition and Hitchcock decomposition (Kolda & Algorithm 1 Tensor Belief Propagation Bader, 2009). In this paper, we assume without loss of generality that the weights are non-negative and sum to 1, input Clique potentials {φc(Xc)}, junction tree J. output Approximate cluster marginals {P˜ (X )} for all s. giving a mixture of rank-1 tensors. Such a mixture forms s s a over rank-1 tensors, which we re- 1: For each cluster X , compute Φ˜ (X ) ≈ Q φ (X ), t t t c c c tensor belief where the product is over all cliques c associated with fer to as a . We also assume that the decom- t. position is non-negative, i.e. the rank-1 tensors are non- negative, although the method can be extended to allow 2: while there is any unsent message do negative values. 3: Pick an unsent message mt→s(Xs) with all mes- sages to t from neighbours other than s sent. For a (clique or cluster) potential function φ (X ) over X Y s s 4: Send m˜ t→s(Xs) ≈ Φ˜ t(Xt) m˜ u→t(Xt). |Xs| = Ns variables, (4) is equivalent to decomposing φs

Xt\Xs u∈N(t)\{s} into a sum of fully-factorised terms, i.e. 5: end while r Q X 6: return P˜s(Xs) ∝ Φ˜ s(Xs) m˜ t→s(Xs). s s t∈N(s) φs(Xs) = wk ψk(Xs) k=1 r There are two challenges in applying the junction tree al- X s s s = w ψ (Xs ) ··· ψ (Xs ). (5) gorithm to high-treewidth graphical models: representing k k,1 1 k,Ns Ns k=1 the intermediate potential functions, and computing them. For representation, using tables to represent the cluster po- This observation allows us to perform marginalisation and tentials and messages requires space exponential in the in- multiplication operations efficiently. duced width. For computation, the two operations relevant to the complexity of the algorithm are marginalisation over 4.2. Marginalisation a subset of variables in a cluster and multiplication of mul- Marginalising out a variable X from a cluster X simpli- tiple factors. When clusters become large, these operations i s fies in the same manner as if the distribution was fully- become intractable unless additional structure can be ex- factorised, namely ploited. r ! TBP alleviates these difficulties by representing all poten- X X s X s φs(Xs) = wk ψk,j(Xi) · (6) tial functions as mixtures of rank-1 tensors. We show how Xi k=1 Xi to perform exact marginalisation of a mixture (required in s s s ψ (Xs )ψ (Xs ) ··· ψ (Xs ) line 4) efficiently, and how to perform approximate multi- k,1 1 k,2 2 k,Ns Ns | {zs } plication of mixtures using sampling (used for approxima- excluding ψk,j (Xi) tion in lines 1 and 4). where we simply push the sum P inside and only eval- Xi s uate it over the univariate factor ψk,j(Xi). We can then 4.1. Mixture of Rank-1 Tensors s absorb this sum into the weights {wk} and the result stays As we are concerned with discrete distributions, each po- in decomposed form (5). To marginalise over multiple vari- ables, we evaluate (6) for each variable in turn. tential φc can be represented by a multidimensional ar- ray, i.e. a tensor. Furthermore, a d-dimensional tensor T ∈ RN1×···×Nd can be decomposed into a weighted sum 4.3. Multiplication 1Finding the treewidth of a graph is NP-hard; in practice we The key observation for multiplication is that a mixture of use various heuristics to construct the junction tree. rank-1 tensors can be treated as a probability distribution Tensor Belief Propagation over the rank-1 tensors with expectation equal to the true Proof. We give the proof for the case B 6= 0 here. The function, by considering the weight of each rank-1 term consistency proof for the case B = 0 can be found in the wk as its probability. supplementary material.

To multiply two potentials φi and φj, we repeatedly sam- The errors in the unnormalised marginals are due to the ple rank-1 terms from each multiplicand to build the prod- errors in approximate pairwise multiplication defined by K uct. We draw a sample of K pairs of indices {(kr, lr)}r=1 Equation (7). We first give bounds for the errors in all the independently from wi and wj respectively, and use the pairwise multiplication by sampling operations, then derive approximation the bounds for the errors in the unnormalised marginals.

K Recall that by the multiplicative Chernoff bound (see e.g. 1 X i j φi(Xi) · φj(Xj) ≈ ψ (Xi) · ψ (Xj). (7) (Koller & Friedman, 2009)), for K i.i.d. random variables K kr lr r=1 Y1,...,YK in range [0,M] with expected value µ, we have

K ! The approximation is also a mixture of rank-1 tensors, with 1  ζ2µK  i j X the rank-1 tensors being the ψ (Xi) · ψ (Xj), and their P Yi ∈/ [(1 − ζ)µ, (1 + ζ)µ] ≤ 2 exp − . kr lr K 3M weights being the frequencies of the sampled (kr, lr) pairs. i=1 This process is equivalent to drawing a sample of the same size from the distribution representing φi(Xi)·φj(Xj) and The number of pairwise multiplications required to initial- ˜ hence provides an unbiased estimate of the product func- ize the cluster potentials {Φt(Xt)} is at most C (line 1). tion. Multiplication of each pair ψi (X ) · ψj (X ) can The junction tree has at most C clusters, and thus at most kr i lr j be performed efficiently due to their factored forms. It is 2(C − 1) messages need to be computed. Each message possible to extend the method to allow multiplication of requires at most C pairwise multiplications (line 4). Hence more potential functions simultaneously but we only use the total number of pairwise multiplications needed is at pairwise multiplication in this paper, repeating the process most 2C2. Each pairwise multiplication involves func- as necessary to multiply more functions. tions defined on at most T binary variables, and thus it si- multaneously estimates at most 2T values. Hence at most T +1 2 4.4. Theoretical Analysis 2 C values are estimated by sampling in the algorithm. For simplicity, we give results for binary MRFs. Extension Since each estimate satisfies the Chernoff bound, we can to non-binary variables is straightforward. We assume that apply a union bound to bound the probability that the esti- exact tensor decompositions are given for all initial clique mate for any µj is outside [(1 − ζ)µj, (1 + ζ)µj], giving potentials. K 1 X Theorem 1. Consider a binary MRF with C max-cliques P Xj ∈/ [(1 − ζ)µ , (1 + ζ)µ ] for ζ K i j j with potential functions represented as non-negative tensor i=1 decompositions. Consider the unnormalised distribution !  2  Q P T +2 2 ζ BK D(X) = φc(Xc). Let Di(Xi) = D(X) any estimate j ≤ 2 C exp − , c∈cl(G) X\Xi 3M be the unnormalised marginal for variable Xi. Assume that the values of the rank-1 tensors in any mixture resulting where B is a lower bound on the minimum value of all cells from pairwise multiplication is upper bounded by M, and estimated during the algorithm. If we set an upper-bound that the approximation target for any cell in any multiplica- on this error probability of δ and rearrange for K, we have tion operation is lower bounded by B. Let D˜ (X ) be the i i that with probability at least δ, all estimates are within a estimates produced by TBP using a junction tree with an factor of (1 ± ζ) of their true values when induced width T . With probability at least 1 − δ, for all i and xi, 3 M 2T +2C2 K ≥ ln . (8) ζ2 B δ (1 − )Di(xi) ≤ D˜ i(xi) ≤ (1 + )Di(xi) if the sample size used for all multiplication operations is We now seek an expression for the sample size required at least for the unnormalised marginals to have small errors. First we argue that at most C + Q − 1 pairwise multiplications C2 M  2 are used in the process of constructing the marginals at any K (, δ) ∈ O log C + T + log . min 2 B δ node u, where Q is the number of clusters. This is true for the base case of a tree of size 1 as no messages need Furthermore, D˜ i(Xi) remains a consistent estimator for to be passed. As the inductive hypothesis, assume that the Di(Xi) if B = 0. statement is true for trees of size less than n. The node u is Tensor Belief Propagation considered as the root of the tree and by the inductive hy- Proof. Suppose the unnormalised marginals have relative pothesis the number of multiplications used in each subtree error bounded by (1 ± ), i.e. i is at most C + Q − 1 where C and Q are the number i i i i ˜ of clique potentials and clusters associated with subtree i. (1 − )Di(xi) ≤ Di(xi) ≤ (1 + )Di(xi)

Summing them up and noting that we need to perform at for all i and xi. Then we have most one additional multiplication for each clique potential ˜ associated with u for initialisation, and one additional mul- Di(xi) (1 + )Di(xi) 1 +  p˜i(xi) = P ≤ P = pi(xi). D˜ i(xi) (1 − )Di(xi) 1 −  tiplication for each subtree, gives us the required result. To xi xi simplify analysis, we bound C + Q − 1 by 2C from here onwards. To bound the relative error on p˜i(xi) to (1 + γ), we set At worst, each pairwise multiplication results in an extra 1 +  γ (1 ± ζ) factor in the bound. Since we are using a multi- = 1 + γ =⇒  = . 1 −  γ + 2 plicative bound, marginalisation operations have no effect 2 2 on the bound. As we do no more than 2C multiplications, 1  γ+2   3   1  2C = < = O γ < 1 the final marginal estimates are all within a factor (1±ζ) Since 2 γ γ γ2 for , the of their true value. increase in Kmin required to bound the normalised esti- mates rather than the unnormalised estimates is at most a To bound the marginal so that it is within a factor (1 ± ) constant. Thus, of its true value for a chosen  > 0, we note that choosing ln(1+) 2C 2C  2   ζ = (1 − ζ) ≥ 1 −  (1 + ζ) ≤ 0 C M 1 2C implies and K (γ, δ) ∈ O log C + T + log . (1 + ): min γ2 B δ  ln(1 + )2C as required. The negative side of the bound is similar. (1 − ζ)2C = 1 − (9) 2C   2C Interestingly, the sample size does not grow exponen- ≥ 1 − ≥ 1 −  tially with the induced width, and hence the treewidth 2C of the graph. As the inference problem is NP-hard,  ln(1 + )2C (1 + ζ)2C = 1 + (10) we expect the ratio M/B to be large in difficult prob- 2C lems. The M/B ratio comes from bounding the relative Pr ≤ exp (ln(1 + )) = 1 + . error when sampling a mixture φ(x) = k=1 wkψk(x). A more refined bound can be obtained by bounding In (9) we use Bernoulli’s inequality together with ln(1 + max max ψ (x)/φ(x) instead; this bound would not ) ≤ , and in (10) we use (1 + x)r ≤ exp(rx). k x k grow as quickly if ψk(x) is always small whenever φ(x) Substituting this ζ into (8), we have that all marginal esti- is small. Understanding the properties of function classes mates are accurate within factor (1 ± ) with probability at where these bounds are small may help us understand when least 1 − δ, when TBP works well. 12C2 M 2T +2C2 K ≥ ln (11) Theorem1 suggests that that it may be useful to reweight (ln(1 + ))2 B δ the rank-1 tensors in a multiplicand to give a smaller Using the fact that ln(1 + ) ≥  · ln 2 for 0 ≤  ≤ 1, and M/B ratio. The following proposition gives a reweighting 24 < 35, we can relax this bound to scheme that minimises the maximum value of the rank-1 ln 2 tensors in a multiplicand, which leads to a smaller M with 35C2 M  2 B fixed. Theorem1 still holds with this reweighting. K ≥ log C + T + log . 2 B δ Pr Proposition 1. Let φ(x) = k=1 wkψk(x) where wk ≥ Pr 0, k=1 wk = 1 and ψk(x) ≥ 0 for all x. Con- Pr 0 0 sider a reweighted representation k=1 wkψk(x), where Corollary 1. Under the same conditions as Theorem1, 0 wk 0 P 0 ψk(x) = 0 ψk(x), each wk ≥ 0, and wk = wk k with probability at least 1 − δ, the normalised marginal es- 0 0 1. Then maxk maxx ψk(x) is minimized when wk ∝ timates p˜i(xi) satisfy wk maxx ψk(x). (1 − γ)pi(xi) ≤ p˜i(xi) ≤ (1 + γ)pi(xi) Proof. Let ak = wk maxx ψk(x), and let v = for all i and xi, if the sample size used for all multiplication 0 ak 0 maxk maxx ψk(x) = maxk 0 . For any choice of w , we operations is at least wk P a 2 k k P    have v ≥ P 0 = k ak. The first inequality holds be- 0 C M 1 k wk Kmin(γ, δ) ∈ O log C + T + log . 0 γ2 B δ cause vwk ≥ ak for any k by the definition of v. Since Tensor Belief Propagation P k ak is a constant lower bound for v, and this is clearly The factored forms of rank-1 tensors again allows the im- 0 achieved by setting wk ∝ ak, we have the claimed re- portance reweighting to be computed efficiently. sult. Propositions1 and2 could be applied directly to the al- gorithm by multiplying two potential functions fully be- Note that with ψ (x) represented as a rank-1 tensor, the k fore sampling. If each potential function has K terms, the maximum value over x can be computed quickly with the multiplication would result in K2 terms which is computa- help of the factored structure. tionally expensive. We only partially exploit the results of Reweighting the rank-1 tensors as described in Proposi- Propositions1 and2 in our algorithm. When multiplying tion1 and then sampling gives a form of importance sam- two potential functions, we draw K pairs of rank-1 ten- pling. Importance sampling is often formulated instead sors from their respective distributions and multiply them with the objective of minimizing the variance of the es- as described earlier but re-weight the resulting mixture us- timates. Since we expect lowering the variance of inter- ing either Proposition1 or2. We call the former max-norm mediate estimates to lead to lower variance on the final reweighting, and the latter min-variance reweighting. marginal estimates, we also examine the following alter- native reweighting scheme. 5. Experiments Pr Proposition 2. Let φ(x) = k=1 wkψk(x) where wk ≥ Pr We present experiments on grid-structured Ising models, 0, k=1 wk = 1. Consider a reweighted representa- Pr 0 0 0 wk random graphs with pairwise Ising potentials, and two tion k=1 wkψk(x), where ψk(x) = w0 ψk(x), each k real-world datasets from the UAI 2014 Inference Compe- w0 ≥ 0, and P w0 = 1. Let φ˜(x) = 1 PK ψ0 (x) k k k K i=1 Ki tition (Gogate, 2014). We test our algorithm against com- be an estimator for φ(x), where each Ki is drawn inde- monly used approximate inference methods, namely loopy pendently from the categorical distribution with parame- belief propagation (labelled BP), mean-field (MF), tree- 0 0 ˜ ters {w1, . . . , wr}. Then φ(x) is unbiased, and the to- reweighted BP (TRW) and Gibbs sampling using existing P ˜0 0 tal variance x Var[φ (x)] is minimized when wk ∝ implementations from the libDAI package (Mooij, 2010). pP 2 wk x ψk(x) . TBP was implemented in C++ inside the libDAI frame- work using Eigen (Guennebaud et al., 2010) and all tests Proof. Clearly φ˜(x) is an unbiased estimator for φ(x). The were executed on a single core of a 1.4 GHz Intel Core i5 variance of this estimator is processor. In each experiment, we time the execution of TBP and allow Gibbs sampling the same wall-clock time. ˜ 1 0 Var[φ(x)] = Var[ψK0 (x)] K i We run the other algorithms until convergence, which oc- 1   curs quickly in each case (hence only the final performance = [ψ0 (x)2] − [ψ0 (x)]2 K E Ki E Ki is shown). Parameters used for BP, MF, TRW and Gibbs are ! given in the supplementary material. To build the junction 1 X = w0 ψ0 (x)2 − φ(x)2 tree, we use the min-fill heuristic2 implemented in libDAI. K k k k 2 ! 5.1. Grid-structured Ising Models 1 X 0 wk 2 2 = w ψk(x) − φ(x) K k w02 k k Figure1 gives results for 10×10 Ising models. The N ×N ! Ising model is a planar grid-structured MRF described by 1 w2 X k 2 2 the joint distribution = 0 ψk(x) − φ(x) . K w k k   Since K and φ(x) are constant, we have 1 X X p(x , . . . , x 2 ) = exp w x x + b x 1 N Z  ij i j i i 0 ∗ X ˜0 (i,j) i {wk} = argmin Var[φ (x)] 0 {wk} x X X 1 where (i, j) runs over all edges in the grid. Each vari- = argmin (w ψ (x))2, 0 k k able X takes value either 1 or −1. In our experiments, {w0 } w i k x k k we choose the wij uniformly from [−2, 2] (mixed interac- 0 P 0 where each wk ≥ 0, k wk = 1. A straightforward appli- tions) or [0, 2] (attractive interactions), and the b uniformly cation of the method of Lagrange multipliers yields from [−1, 1]. We use a symmetric rank-2 tensor decompo- sition for the pairwise potentials. We measure performance w pP ψ (x)2 w0 = k x k . k P pP 2 2Min-fill repeatedly eliminates variables that result in the low- wl ψl(x) l x est number of additional edges in the graph (Koller & Friedman, 2009). Tensor Belief Propagation

Attractive interactions Mixed interactions 0.5 0.5 TBP (NONE) MF TRW TBP (NONE) MF TRW TBP (MAX) BP Gibbs TBP (MAX) BP Gibbs 0.4 TBP (VAR) 0.4 TBP (VAR)

0.3 0.3

0.2 0.2 Marginal error Marginal error Marginal

0.1 0.1

0.0 0.0 101 102 103 104 105 101 102 103 104 105 Sample size K (TBP); matched running time (Gibbs) Sample size K (TBP); matched running time (Gibbs)

Figure 1: Marginal error of TBP for various sample sizes K on 10 × 10 Ising models compared with other approximate inference methods (NONE = no reweighting, MAX = max-norm reweighting, VAR = min-variance reweighting).

2 by the mean L1 error of marginal estimates over the N bi are chosen uniformly from [−1, 1]. variables, Figure 2a shows the performance of TBP with increas- N 2 ing sample size K on 15-node models. Similar to grid- 1 X |P (X = 1) − P (X = 1)|. structured Ising models, TBP performs well as the sample N 2 exact i approx i i=1 size is increased. Notably, TBP performs very well on at- tractive models where other methods perform poorly. Exact marginals were obtained using the exact junction tree algorithm (possible because the grid size is small). Results Figure 2b shows the performance of the algorithms as the are all averages over 100 model instances generated ran- graph size is increased to 30 nodes. Interestingly, TBP domly with the specified parameters; error bars show stan- starts to fail for mixed interaction models as the graph size dard error. We give a description of the tensor decomposi- is increased but remains very effective for attractive inter- tion and additional results for a range of grid sizes N and action models while other methods perform poorly. interaction strengths wij in the supplementary material. 5.3. UAI Competition Problems As we increase the number of samples used for multiplica- tion K, and hence running time, the performance of TBP Finally, we show results of TBP on two real-world datasets improves as expected. On both attractive and mixed inter- from the UAI 2014 Inference Competition, namely the action models, loopy BP, mean-field and tree-reweighted Promedus and Linkage datasets. We chose these two prob- BP perform poorly. Gibbs sampling performs poorly on lems following Zhu & Ermon(2015). To compute the ini- models with attractive interactions but performs better on tial tensor rank decompositions, we use the non-negative models with mixed interactions. This is expected since cp nmu method from Bader et al.(2015), an iterative op- models with mixed interactions mix faster. Reweighting timisation method based on (Lee & Seung, 2001). The ini- gives a noticeable improvement over not using reweighting; tial potential functions are decomposed into mixtures with both max-norm reweighting and min-variance reweighting r components. We show results for r = 2 and r = 4. We give very similar results (in Figure1, they not easily differ- measure performance by the mean L1 error over all states entiated). For the remainder of our experiments, we show and all variables results for max-norm reweighting only. N 1 X X |Pexact(Xi = xi) − Papprox(Xi = xi)|, NSi 5.2. Random Pairwise MRFs i=1 xi

We also test TBP on random binary N-node MRFs with where Si is the number of states of variable Xi. Results are pairwise Ising potentials. We construct the graphs by inde- averages over all problem instances in (Gogate, 2014). pendently adding each edge (i, j) with probability 0.5, with We see that TBP outperforms Gibbs sampling on both the requirement that the resulting graph is connected. Pair- problems3. On the Linkage dataset, using r = 2 mix- wise potentials are of the same form as Ising models (i.e. exp[wijxixj]), where interaction strengths wij are chosen 3We omit results for BP, MF and TRW for these problems be- uniformly from [0, 2] (attractive) or [−2, 2] (mixed) and the cause the libDAI package was unable to run them. Tensor Belief Propagation

Attractive interactions Mixed interactions Attractive interactions Mixed interactions 0.5 0.5 TBP (MAX) BP Gibbs TBP (MAX) BP Gibbs TBP (100000) BP Gibbs TBP (100000) BP Gibbs 0.6 MF TRW 0.6 MF TRW MF TRW MF TRW 0.4 0.4 0.5 0.5

0.3 0.3 0.4 0.4

0.3 0.3 0.2 0.2 Marginal error Marginal error Marginal Marginal error Marginal error Marginal 0.2 0.2 0.1 0.1 0.1 0.1

0.0 0.0 0.0 0.0 101 102 103 104 105 101 102 103 104 105 10 15 20 25 30 10 15 20 25 30 Sample size K (TP); matched running time (Gibbs) Sample size K (TP); matched running time (Gibbs) Number of variables Number of variables (a) Effect of sample size K for N = 15 nodes (b) Effect of graph size N

Figure 2: Performance of TBP on random pairwise MRFs compared with other approximate inference algorithms.

Linkage Promedus gibbs TBP (r=2) TBP (r=4) gibbs TBP (r=2) TBP (r=4) 0.30 0.30

0.25 0.25

0.20 0.20

0.15 0.15 Marginal error Marginal error Marginal 0.10 0.10

0.05 0.05

0.00 0.00 101 102 103 104 105 101 102 103 104 105 Sample size K (TBP); matched running time (Gibbs) Sample size K (TBP); matched running time (Gibbs)

Figure 3: Marginal error of TBP vs Gibbs sampling on UAI 2014 Inference Competition problems for various sample sizes K. ture components for the initial decompositions performs gives consistent estimators, and performs well on a range best and increasing r does not improve performance. On of problems against well-known algorithms. the Promedus dataset r = 4 performs better; in this case, In this paper, we have not addressed how to perform the ini- increasing r may improve the accuracy of the decompo- tial tensor decomposition. There exist well-known optimi- sition of the initial potential functions. For reference, the sation algorithms for this problem to minimize Euclidean marginal error achieved by the random projection method error (Kolda & Bader, 2009), though it is not clear that this of Zhu & Ermon(2015) is 0.09 ± 0.02 for Linkage and is the best objective for the purpose of TBP. We are cur- 0.21 ± 0.06 for Promedus. TBP with K = 105 achieved rently investigating the best way of formulating and solving error of 0.08±0.01 for Linkage and 0.12±0.01 for Prome- this optimisation problem to yield accurate results during dus despite using substantially less running time 4. inference. Further, when constructing the junction tree, it is not clear whether min-fill and other well-known heuris- 6. Conclusion and Future Work tics remain appropriate for TBP. In particular, these cost functions aim to minimise the induced width of the junc- We proposed a new sampling-based approximate inference tion tree which is no longer directly relevant to the perfor- algorithm, tensor belief propagation, which uses a mixture- mance of the algorithm. Further work may investigate other of-rank-1-tensors representation to approximate cluster po- heuristics, for example with the goal of minimising cluster tentials and messages in the junction tree algorithm. It variance. 4 5 TBP with K = 10 takes an average across all problem Finally, approximate inference algorithms have applica- instances of approximately 10 minutes (Linkage) and 3 minutes (Promedus) per problem on our machine, and does not differ sub- tions in learning (for example, in the gradient computa- stantially for r = 2 vs r = 4. The results described by Zhu & tions when training restricted Boltzmann machines) and Ermon(2015) were comparable in running time to 10 million iter- thus TBP may improve learning as well as inference in ations of Gibbs sampling, which we estimate would take several some applications. hours on our machine without parallelism. Tensor Belief Propagation

Acknowledgements Lauritzen, Steffen L and Spiegelhalter, David J. Local com- putations with probabilities on graphical structures and We thank the anonymous reviewers for their helpful com- their application to expert systems. Journal of the Royal ments. This work is supported by NUS AcRF Tier 1 Statistical Society. Series B (Methodological), pp. 157– grant R-252-000-639-114 and a QUT Vice-Chancellor’s 224, 1988. Research Fellowship Grant. Lee, Daniel D and Seung, H Sebastian. Algorithms for References non-negative matrix factorization. In Advances in Neural Information Processing Systems (NIPS), pp. 556–562, Andrieu, Christophe, De Freitas, Nando, Doucet, Arnaud, 2001. and Jordan, Michael I. An introduction to MCMC for machine learning. Machine learning, 50(1-2):5–43, McAuley, Julian J. and Caetano, Tiberio´ S. Faster Algo- 2003. rithms for Max-Product Message-Passing. Journal of Machine Learning Research, 12:1349–1388, 2011. Bader, Brett W., Kolda, Tamara G., et al. MATLAB Tensor Toolbox Version 2.6. http://www.sandia.gov/ Minka, Thomas P. Expectation propagation for approx- ˜tgkolda/TensorToolbox/, February 2015. imate Bayesian inference. In Proceedings of the Sev- enteenth conference on Uncertainty in Artificial Intelli- Chavira, Mark and Darwiche, Adnan. Compiling Bayesian gence, pp. 362–369. Morgan Kaufmann Publishers Inc., Networks with Local Structure. In IJCAI, 2005. 2001. Darwiche, Adnan. A Differential Approach to Inference in Mooij, Joris M. libDAI: A Free and Open Source C++ Bayesian Networks. In Uncertainty in Artificial Intelli- Library for Discrete Approximate Inference in Graphi- gence (UAI), 2000. cal Models. Journal of Machine Learning Research, 11: 2169–2173, August 2010. Gogate, Vibhav. UAI 2014 Inference Competition. http://www.hlt.utdallas.edu/˜vgogate/ Park, James D. and Darwiche, Adnan. A Differential Se- uai14-competition/index.html, 2014. mantics for Jointree Algorithms. In Advances in Neural Information Processing Systems (NIPS), 2002. Gogate, Vibhav and Domingos, Pedro. Structured Message Passing. In Uncertainty in Artificial Intelligence (UAI), Pearl, Judea. Reverend Bayes on inference engines: A dis- 2013. tributed hierarchical approach. In AAAI, pp. 133–136, 1982. Guennebaud, Gael,¨ Jacob, Benoˆıt, et al. Eigen v3. http: //eigen.tuxfamily.org, 2010. Shafer, Glenn R and Shenoy, Prakash P. Probability prop- agation. Annals of Mathematics and Artificial Intelli- Ihler, Alexander T and McAllester, David A. Particle Belief gence, 2(1-4):327–351, 1990. Propagation. In AISTATS, pp. 256–263, 2009. Sudderth, Erik B, Ihler, Alexander T, Isard, Michael, Free- Jensen, Finn Verner, Olesen, Kristian G, and Andersen, man, William T, and Willsky, Alan S. Nonparametric Stig Kjaer. An algebra of Bayesian belief universes for Belief Propagation. Communications of the ACM, 53 knowledge-based systems. Networks, 20(5):637–659, (10):95–103, 2010. 1990. Wainwright, Martin J and Jordan, Michael I. Graphical Kolda, Tamara G and Bader, Brett W. Tensor decompo- models, exponential families, and variational inference. sitions and applications. SIAM review, 51(3):455–500, Foundations and Trends in Machine Learning, 1(1-2):1– 2009. 305, 2008. Koller, D. and Friedman, N. Probabilistic Graphical Mod- Xue, Yexiang, Ermon, Stefano, Lebras, Ronan, Gomes, els: Principles and Techniques. Adaptive Computation Carla P, and Selman, Bart. Variable Elimination in and Machine Learning. MIT Press, 2009. Fourier Domain. In Proc. 33rd International Conference on Machine Learning, 2016. Koller, Daphne, Lerner, Uri, and Angelov, Dragomir. A general algorithm for approximate inference and its ap- Zhu, Michael and Ermon, Stefano. A Hybrid Approach plication to hybrid Bayes nets. In Proceedings of the for Probabilistic Inference using Random Projections. In Fifteenth conference on Uncertainty in artificial intelli- ICML, pp. 2039–2047, 2015. gence, pp. 324–333. Morgan Kaufmann Publishers Inc., 1999.