<<

Neural Enhanced on Factor Graphs

Victor Garcia Satorras Max Welling UvA-Bosch Delta Lab UvA-Bosch Delta Lab University of Amsterdam University of Amsterdam [email protected] [email protected]

Abstract on the using message passing algo- rithms such as Belief Propagation (BP) (Pearl 2014; K. Murphy, Weiss, and Jordan 2013). Provided that A graphical model is a structured represen- the true generative process of the data is given by a tation of locally dependent random variables. non-loopy graphical model, BP is guaranteed to com- A traditional method to reason over these pute the optimal (posterior) marginal probability dis- random variables is to perform inference us- tributions. However, in real world scenarios, we may ing belief propagation. When provided with only have access to a poor approximation of the true the true data generating process, belief prop- distribution of the graphical model, leading to sub- agation can infer the optimal posterior prob- optimal estimates. In addition, an important limita- ability estimates in structured factor tion of belief propagation is that on graphs with loops graphs. However, in many cases we may only BP computes an approximation to the desired poste- have access to a poor approximation of the rior marginals or may fail to converge at all. data generating process, or we may face loops in the , leading to suboptimal es- In this paper we present a hybrid inference model to timates. In this work we first extend graph tackle these limitations. We cast our model as a mes- neural networks to factor graphs (FG-GNN). sage passing method on a factor graph that combines We then propose a new hybrid model that messages from BP and from a Graph Neural Network runs conjointly a FG-GNN with belief prop- (GNN). The GNN messages are learned from data and agation. The FG-GNN receives as input mes- complement the BP messages. The GNN receives as sages from belief propagation at every infer- input the messages from BP at every inference itera- ence iteration and outputs a corrected version tion and delivers as output a refined version of them of them. As a result, we obtain a more ac- back to BP. As a result, given a labeled dataset, we ob- curate that combines the benefits tain a more accurate algorithm that outperforms either of both belief propagation and graph neural Belief Propagation or Graph Neural Networks when networks. We apply our ideas to error cor- run in isolation in cases where Belief Propagation is rection decoding tasks, and we show that our not guaranteed to obtain the optimal marginals. algorithm can outperform belief propagation arXiv:2003.01998v5 [cs.LG] 16 Mar 2021 Belief Propagation has demonstrated empirical success for LDPC codes on bursty channels. in a variety of applications: Error correction decoding (McEliece, D. J. C. MacKay, and Cheng 1998), combinatorial optimization in particular graph 1 Introduction coloring and satisfiability (Braunstein and Zecchina 2004), inference in markov logic networks (Richard- Graphical models (C. M. Bishop 2006; K. P. Murphy son and Domingos 2006), the Kalman Filter is a spe- 2012) are a structured representation of locally depen- cial case of the BP algorithm (Yedidia, Freeman, and dent random variables, that combine concepts from Weiss 2003; Welch, G. Bishop, et al. 1995) etc. One probability and graph theory. A standard way to rea- of its most successful applications is Low Density Par- son over these random variables is to perform inference ity Check codes (LDPC) (Gallager 1962) an error cor- rection decoding algorithm that runs BP on a loopy Proceedings of the 24th International Conference on Artifi- . LDPC is currently part of the Wi- cial Intelligence and Statistics (AISTATS) 2021, San Diego, Fi 802.11 standard, it is an optional part of 802.11n California, USA. PMLR: Volume 130. Copyright 2021 by and 802.11ac, and it has been adopted for 5G, the the author(s). Neural Enhanced Belief Propagation on Factor Graphs

fifth generation wireless technology that began wide the receiver nodes, thereby transporting information deployment in 2019. Despite being a loopy algorithm, about the variable’s probabilities. We can distinguish its bipartite graph is typically very sparse which re- two types of messages: those that go from variables to duces the number of loops and increases the cycle size. factors and those that go from factors to variables. As a result, in practice LDPC has shown excellent re- Variable to factor: µ (x ) is the product of sults in error correction decoding tasks and performs xm→fs m all incoming messages to variable x from all neighbor close to the Shannon limit in Gaussian channels. m factors N (xm) except for factor fs. However, a Gaussian channel is an approximation of Y the more complex noise distributions we encounter in µxm→fs (xm) = µfl→xm (xm) (1) the real world. Many of these distributions have no l∈N (xm)\fs analytical form, but we can approximate them from data. In this work we show the robustness of our algo- Factor to variable: µfs→xn (xn) is the product of rithm over LDPC codes when we assume such a non- the factor fs itself with all its incoming messages from analytical form. Our hybrid method is able to closely all variable neighbor nodes except for xn marginalized match the performance of LDPC in Gaussian chan- over all associated variables xs except xn. nels while outperforming it for deviations from this X Y assumption (i.e. a bursty noise channel (Gilbert 1960; µfs→xn (xn) = fs(xs) µxm→fs (xm) (2) Kim et al. 2018)). xs\xn m∈N (fs)\n The three main contributions in our work are: i) We To run the Belief Propagation algorithm, messages extend the standard graph neural network equations to are initialized with uniform probabilities, and the factor graphs (FG-GNN). ii) We present a new hybrid two above mentioned operations are then recursively inference algorithm, Neural Enhanced Belief Propaga- run until convergence. One can subsequently obtain tion (NEBP) that refines BP messages using the FG- marginal estimates p(x ) by multiplying all incoming GNN. iii) We apply our method to an error correction n messages from the neighboring factors: decoding problem for a non-Gaussian (bursty) noise channel and show clear improvement on the Bit Error Y Rate over existing LDPC codes. p(xn) ∝ µfs→xn (xn) (3)

s∈N (xn)

2 Background From now on, we simplify notation by removing the argument of the messages function. In the left side of 2.1 Factor Graphs Figure 1 we can see the defined messages on a factor graph where black squares represent factors and blue Factor graphs (Loeliger 2004) are a convenient way of circles represent variables. representing graphical models. A factor graphs is a bi- partite graph that interconnects a set of factors f (x ) s s 2.3 LDPC codes with a set of variables xs, each factor defining depen- dencies among its subset of variables. A global proba- In this paper we will apply our proposed method to bility distribution p(x) can be defined as the product error correction decoding. Low Density Parity Check 1 Q of all these factors p(x) = Z s∈F fs(xs), where Z (LDPC) codes (Gallager 1962; D. J. MacKay 2003) are is the normalization constant of the probability distri- linear codes used to correct errors in data transmitted bution. A visual representation of a Factor Graph is through noisy communication channels. The sender depicted in the left image of Figure 1. encodes the data with redundant bits while the re- ceiver has to decode the original message. In an LDPC 2.2 Belief Propagation code, a parity check sparse matrix H ∈ B(n−k)×n is de- signed, such that given a code-word x ∈ Bn of n bits Belief Propagation (C. M. Bishop 2006), also known the product of H and x is constrained to equal zero: as the sum-product algorithm, is a message passing Hx = 0. H can be interpreted as an adjacency matrix algorithm that performs inference on graphical mod- that connects n variables (i.e. the transmitted bits) els by locally marginalizing over random variables. It with (n − k) factors, i.e. the parity checks that must exploits the structure of factor graphs, allowing more sum to 0. The entries of H are 1 if there is an edge be- efficient computation of the marginals. Belief Propa- tween a factor and a variable, where rows index factors gation directly operates on factor graphs by sending and columns index variables. For a linear code (n, k), messages (real valued functions) on its edges. These n is the number of variables and (n − k) the number messages exchange beliefs of the sender nodes about of factors. The prior probability of the transmitted Victor Garcia Satorras, Max Welling code-word P (x) ∝ 1[Hx = 0 mod 2] can be factorized Another interesting line of research concerns the con- as: vergence of graphical models with neural networks. In (Mirowski and LeCun 2009), the conditional probabil- Y  X i Y P (x) ∝ 1 xn = 0 mod 2 = fs(xs) (4) ity distributions of a graphical model are replaced with s n∈N (s) s trainable factors. (Johnson et al. 2016) learns a graph- ical latent representation and runs Belief Propagation At the receiver we get a noisy version of the code- on it. Combining the strengths of convolutional neural word, r. The noise is assumed to be i.i.d, therefore we networks and conditional random fields has shown to can express the of the received be effective in image segmentation tasks (Chen et al. Q  code-word as x as P (r|x) = n P rn|xn . Finally we 2014; Zheng et al. 2015). A model to run Neural Net- can express the posterior distribution of the transmit- works on factor graphs was also introduced in (Zhang, ted code-word given the received signal as: Wu, and Lee 2019). However, in our case, we simply adjust the Graph Neural Network equations to the fac- P (x|r) ∝ P (x)P (r|x) (5) tor graph scenario as a building block for our hybrid model (NEBP). Equation 5 is a product of factors, where some factors in P (x) (eq. 4) are connected to multiple variables More closely to our work, (Yoon et al. 2018) trains expressing a constraint among them. Other factors a graph neural network to estimate the marginals in P (r|x) are connected to a single variable expressing a Binary Markov Random Fields (Ising model) and the prior distribution for that variable. A visual represen- performance is compared with Belief Propagation for tation of this factor graph is shown in the left image loopy graphs. In our work we are proposing a hybrid of Figure 1. Finally, in order to infer the transmit- method that combines the benefits of both GNNs and ted code-word x given r, we can just run (loopy) Be- BP in a single model. In (Nachmani, Be’ery, and Bur- lief Propagation described in section 2.2 on the Factor shtein 2016) some weights are learned in the edges of Graph described above (equation 5). In other words, the Tanner graph for High Density Parity Check codes, error correction with LDPC codes can be interpreted in our case we use a GNN on the defined graphical as an instance of Belief Propagation applied to its as- model and we test our model on Low Density Parity sociated factor graph. Check codes, one of the standards in communications for error decoding. A subsequent work (Liu and Poulin 2019) uses the model from (Nachmani, Be’ery, and 3 Related Work Burshtein 2016) for quantum error correcting codes. Recently, (Kuck et al. 2020) presented a strict gener- One of the closest works to our method is (Sator- alization of Belief Propagation with Neural Networks, ras, Akata, and Welling 2019) which also combines in contrast, our model augments Belief Propagation graphical inference with graph neural networks. How- with a Graph Neural Network which learned messages ever, in that work, the model is only applied to the are not constrained to the message passing scheme of Kalman Filter, a hidden Gaussian Markov model for Belief Propagation, refining BP messages without need time sequences, and all factor graphs are assumed to to backpropagate through them. be pair-wise. In our case, we run the GNN in arbitrary Factor Graphs, and we hybridize Belief Propagation, which allows us to enhance one of its main applications 4 Method (LDPC codes). Other works also learn an inference model from data like Recurrent Inference Machines 4.1 Graph Neural Network for Factor (Putzky and Welling 2017) and Iterative Amortized Graphs Inference (Marino, Yue, and Mandt 2018). However, in our case we are presenting a hybrid algorithm in- We will propose a hybrid method to improve Belief stead of a fully learned one. Additionally in (Putzky Propagation (BP) by combining it with Graph Neural and Welling 2017) graphical models play no role. Networks (GNNs). Both methods can be seen as mes- sage passing on a graph. However, where BP sends Our work is also related to meta learning (Schmidhu- messages that follow directly from the definition of ber 1987; Andrychowicz et al. 2016) in the sense that the graphical model, messages sent by GNNs must be it learns a more flexible algorithm on top of an already learned from data. To achieve seamless integration of existing one. It also has some interesting connections the two message passing algorithms, we will first ex- to the ideas from the consciousness prior (Bengio 2017) tend GNNs to factor graphs. since our model is an actual implementation of a sparse factor graph that encodes prior knowledge about the Graph Neural Networks (Bruna et al. 2013; Defferrard, task to solve. Bresson, and Vandergheynst 2016; Kipf and Welling Neural Enhanced Belief Propagation on Factor Graphs

Figure 1: Visual representation of a LDPC Factor Graph (left) and its equivalent representation in our Graph Neural Network (right). In the Factor Graph, factors are displayed as black squares, variables as blue circles. In the Graph Neural Network, nodes associated to factors are displayed as black circles. Nodes associated to variables are displayed as blue circles.

GNN and Factor Graphs becomes less trivial when each fac- tor may contain more than two variables. We can then v → e mt = φ (ht, ht , a ) i→j e i j ij no longer consider each factor as an edge of the GNN. In this work we propose special case of GNNs to run mt = P mt j i∈N (j) i→j on factor graphs with an arbitrary number of variables e → v t+1 t t per factor. hj = φv([mj, aj], hj) Similarly to Belief Propagation, we first consider a Factor Graph as a bipartite graph Gf = (Vf , Ef ) Table 1: Graph Neural Network equations. with two type of nodes Vf = X ∪ F, variable-nodes vx ∈ X and factor-nodes vf ∈ F, and two types of edge interactions, depending on if they go from 2016) operate on graph-structured data by modelling factor-node to variable-node or vice-versa. With this interactions between pairs of nodes. A graph is defined graph definition, all interactions are again pair-wise as a tuple G = (V, E), with nodes v ∈ V and edges (between factor-nodes and variable-nodes in the bipar- e ∈ E. Table 1 presents the edge and node operations tite graph). that a GNN defines on a graph using similar notation as (Gilmer et al. 2017). A mapping between a factor graph and the graph we use in our GNN is shown in Figure 1. All factors from The message passing procedure of a GNN is divided the factor graph are assumed to be factor-nodes in into two main steps: from node embeddings to edge the GNN. We make an exception for factors connected embeddings v → e, and from edge to nodes e → v. to only one variable which we simply consider as at- t Where hi is the embedding of a node vi, φe is the tributes of that variable-node in order to avoid redun- t edge operation, and mi→j is the embedding of the eij dant nodes. Once we have defined our graph, we use edge. First, the edge embeddings mi→j are computed, the GNN notation from Table 1, and we re-write it which one can interpret as messages, next we sum all specifically for this new graph. From now on we refer- node vj incoming messages. After that, the embedding ence these new equations as FG-GNN. Equations for t representation for node vj, hj, is updated through the the FG-GNN are presented in Table 2. node function φv. Values aij and aj are optional edge and node attributes respectively. FG-GNN In order to integrate the GNN messages with those of t t t mx→f = φx→f (hf , hx, ax→f ) BP we have to run them on a Factor Graph. In (Yoon v → e t t t mf→x = φf→x(hx, hf , af→x) et al. 2018) a GNN was defined on pair-wise factor t P t mf = x∈N (f) mx→f graphs (ie. a factor graph where each factor contains mt = P mt only two variables). In their work each variable of the e → v x f∈N (x) f→x ht+1 = φ ([mt , a ], ht ) factor graph represents a node in the GNN, and each f vf f f f ht+1 = φ ([mt , a ], ht ) factor connecting two variables represented an edge x vx x v x in the GNN. Properties of the factors were associated with edge attributes aij. The mapping between GNNs Table 2: Graph Neural Network for a Factor Graph Victor Garcia Satorras, Max Welling

Notice that in the GNN we did not have two differ- mt will refine the rest of messages µ˜ t \µ˜ t . f→x f→x fl→x ent kind of variables in the graph and hence we only All other variables computed inside FG-GNN(·) are needed one edge function φe (but notice that the or- kept internal to this function. der of the arguments of this function matters so that Finally, fs(·) and fu(·) take as input the embeddings a message from i → j is potentially different from the t Mf→x and output a refinement for the current mes- message in the reverse direction). For the FG-GNN t however, we now have two types of nodes, which ne- sage estimates µ˜ f→x. Particularly, fs(·) outputs a pos- itive scalar value that multiplies the current estimate, cessitate two types of edge functions, φx→f and φf→x, depending on whether the message was sent by a vari- and fu(·) outputs a positive vector which is summed able or a factor node. In addition, we also have two to the estimate. Both functions encompass two Multi Layer Perceptrons (MLP), one MLP takes as input the type of node embeddings hx and hf for the two types node embeddings ht+1, and outputs the refinement for of nodes vx and vf . Again we sum over all incoming x messages for each node, but now in the node update we the singleton factor messages, the second MLP takes t as input the edge embeddings mf→x and outputs a have two different functions, φvf for the factor-nodes refinement for the rest of messages µ˜ t \µ˜ t . In and φvx for the variable-nodes. The optional edge at- f→x fl→x tributes are now labeled as ax→f , af→x, and the node summary, the hybrid algorithm thus looks as follows: attributes af and av. t t t  µ˜ f→x, µ˜ x→f = BP µf→x t t t t  Mf→x = FG-GNN h , µ˜ f→x, µ˜ x→f (6) µt+1 = µ˜ t f (Mt ) + f (Mt ) 4.2 Neural Enhanced Belief Propagation f→x f→x s f→x u f→x After running the algorithm for N iterations. We ob- Now that we have defined the FG-GNN we can intro- tain the estimatep ˆ(xi) by using the same operation duce our hybrid method that runs co-jointly with Be- as in Belief Propagation (eq. 3), which amounts to lief Propagation on a factor graph, we denote this new taking a product of all incoming messages to node xi, method Neural Enhanced Belief Propagation (NEBP). i.e.p ˆ(x ) ∝ Q µ . From these marginal i s∈N (xi) fs→xi At a high level, the procedure is as follows: after every distributions we can compute any desired quantity on Belief Propagation iteration, we input the BP mes- a node. sages into the FG-GNN. The FG-GNN then runs for two iterations and updates the BP messages. This step 4.3 Training and Loss is repeated recursively for N iterations. After that, we can compute the marginals from the refined BP mes- The loss is computed from the estimated marginals sages. pˆ(x) and ground truth values xgt, which we assume We first define the two functions BP(·) and known during training. In the LDPC experiment the FG-GNN(·). BP(·) takes as input the factor-to-node ground truths xgt are the transmitted bits which are t known by the receiver during the training stage. messages µf→x, then runs the two BP updates eqns. 1 and 2 respectively and outputs the result of that Loss(Θ) = L (xgt, pˆ(x)) + R (7) t t 0 computation as µ˜ f→x, µ˜ x→f . We initialize µf→x as uniform distributions. During training we back-propagate through the whole multi-layer estimation model (with each layer an it- The function FG-GNN(·) runs the equations displayed eration of the hybrid model), updating the FG-GNN, in Table 2. At every t iteration we give it as input the t t t fs(·) and fu(·) weights Θ. The number of training quantities h = {hx|x ∈ X } ∪ {hf |f ∈ F}, ax→f , iterations is chosen by cross-validating. In our experi- t 0 af→x and av. h is initialized randomly as h by sam- ments we use the binary cross entropy loss for L. The pling from a . Moreover, the at- regularization term R is the mean of fu(·) outputs, tributes ax→f and af→x are provided to the function i.e. R = 1 P meanf (Mt ). It encourages the t t N t u f→x FG-GNN(·) as the messages µ˜ x→f and µ˜ f→x obtained model to behave closer to Belief Propagation. In case, from BP(·), as an exception, the subset of messages fu(·) output is set to 0, the hybrid algorithm would from µ˜ f→x that go from a singleton factor fl to its be equivalent to Belief Propagation. This happens neighbor variable are treated as attributes a . The v because fs(·) outputs a scalar value that on its own outputs of FG-GNN(·) are the updated latent vectors only modifies the norm of µ˜ t . Belief Propagation t+1 t f→x hx and the latent messages mf→x computed as part can operate on unnormalized beliefs, although it is a of the FG-GNN algorithm in Table 2. These latent common practice to normalize them at every BP(·) t t t+1 representations Mf→x = mf→x ∪ hx will be used to iteration to avoid numerical instabilities. Therefore, t update the current message estimates µ˜ f→x, specif- only modifying the norm of the messages on its own ically ht+1 will refine those messages µ˜ t that go doesn’t change our hybrid algorithm because messages x fl→x from a singleton factor to its neighbor variable and are being normalized at every BP iteration. Neural Enhanced Belief Propagation on Factor Graphs

Figure 2: Graphical illustration of our Neural Enhanced Belief Propagation algorithm. Three modules are depicted in each iteration {BP, FG-GNN, Comb.}. Each module is associated to each one of the three lines from Equation 6.

5 Experiments More formally:

r = x + z + p w (8) We analyze the performance of Belief Propagation, i i i i i

FG-GNNs, and our Neural Enhanced Belief Propaga- Where ri is the received signal, and pi follows a tion (NEBP) in an error correction task where Be- Bernoulli distribution such that pi = 1 with proba- lief Propagation is also known as LDPC (Section 2.3) bility ρ, and pi = 0 with probability 1 − ρ. In our and in an inference task on the Ising model (Section experiments, we set ρ = 0.05 as done in (Kim et al. 5.2). In both FG-GNN and NEBP, the edge operations 2018). This bursty channel describes how unexpected σx→f , σf→x defined in Section 4 consist of two lay- signals may cause interference in the middle of a trans- ers Multilayer Perceptrons (MLP). The node update mitted frame. For example, radars may cause bursty functions σvf and σvx consist of two layer MLPs fol- interference in wireless communications. In LDPC, the lowed by a Gated Recurrent Unit (Chung et al. 2014). SNR is assumed to be known and fixed for a given Functions fs(·) and fu(·) from the NEBP combination frame, yet, in practice it needs to be estimated with a module also contain two layers MLPs. known preamble (the pilot sequence) transmitted be- fore the frame. If bursty noise occurs in the middle of the transmission, the estimated SNR is blind to this 5.1 Low Density Parity Check codes new noise level. LDPC codes, explained in section 2.3 are a particu- Dataset: We use the parity check matrix H lar case of Belief Propagation run on a bipartite graph ”96.3.963” from (D. MacKay and Codes 2009) for all for error correction decoding tasks. Bipartite graphs experiments, with n = 96 variables and k = 48 factors, contain cycles, hence Belief Propagation is no longer i.e. a transmitted code-word x ∈ Bn contains 96 bits. guaranteed to converge nor to provide the optimal es- The training dataset consists of pairs of received and timates. Despite this lack of guarantees, LDPC has transmitted code-words {(rd, xd)}1≤d≤L. The trans- shown excellent results near the Shannon limit (D. J. mitted code-words x are used as ground truth for MacKay and Neal 1996) for Gaussian channels. LDPC training the decoding algorithm. The received code- assumes a channel with an analytical solution, com- words r are obtained by transmitting x through the monly a Gaussian channel. In real world scenarios, the bursty channel from Equation 8. We generate samples channel may differ from Gaussian or it may not even for SNRdb = {0, 1, 2, 3, 4}. Regarding the bursty noise have a clean analytical solution to run Belief Propaga- σb, we randomly sample its standard deviation from a tion on, leading to sub-optimal estimates. An advan- uniform distribution σb ∈ [0, 5]. We generate a valida- tage of neural networks is that, in such cases, they can tion partition of 750 code-words (150 code-words per learn a decoding algorithm from data. SNRdb value). For the training partition we keep gen- erating samples until convergence, i.e. until we do not In this experiment we consider the bursty noisy chan- see further improvement in the validation accuracy. nel from (Kim et al. 2018), where a signal xi is trans- mitted through a standard Gaussian channel zi ∼ Training procedure: We provide as input to the 2 N (0, σc ), however this time, a larger noise signal model the received code-word rd and the SNR for 2 wi ∼ N (0, σb ) is added with a small probability ρ. that code-word. These values are provided as node Victor Garcia Satorras, Max Welling

Figure 3: Bit Error Rate (BER) with respect to the Signal to Noise Ratio (SNR) for different bursty noise values σb ∈ {0, 1, 2, 3, 4, 5}. attributes av described in Section 4. We run the algo- rithms for 20 iterations and the loss is computed as the cross entropy between the estimated xˆ and the ground truth xd. We use an Adam optimizer (Kingma and Ba 2014) with a learning rate 2e−4 and batch size of 1. The number of hidden features is 32 and all acti- vation functions are ’Selus’ (Klambauer et al. 2017). As a evaluation metric we compute the Bit Error Rate (BER), which is the number of error bits divided by the total amount of transmitted bits. The number of test code-words we used to evaluate each point from our plots (Figure 3) is at least 200 , where n is the BERˆ ·n number of bits per code-word and BERˆ is the esti- mated Bit Error Rate for that point. Figure 4: Bit Error Rate (BER) with respect to σb value for a fixed SNR=3. Baselines: Beside the already mentioned methods (FG-GNN and standard LDPC error correction de- coding), we also run two extra baselines. The first one we call Bits baseline, which consists of indepen- sweep the SNR from 0 to 4. Notice that for σb = 0 the bursty noise is non-existent and the channel is equiv- dently estimating each bit that maximizes p(ri|xi). The other baseline, called LDPC-bursty, is a variation alent to an Additive White Gaussian Noise channel of LDPC, where instead of considering a SNR with a (AWGN). LDPC has analytically been designed for 2 this channel obtaining its best performance here. The noise level σc = var[z], we consider the noise distribu- tion from Equation 8 such that now the noise variance aim of our algorithm is to outperform LDPC for σb > 0 2 2 2 2 while still matching its performance for σb = 0. As is σ = var[z +pw] = σ +(ρ(1−ρ)+ρ )E 2 [σ ]. This c σb b is a fairer comparison to our learned methods, because shown in the plots, as σb increases, the performance of NEBP and FG-GNN improves compared to the other even if we are blind to the σb value, we know there may be a bursty noise with probability ρ and σ ∼ U(0, 5). methods, with NEBP always achieving the best per- b formance, and getting close to the LDPC performance Results: In Figure 3 we show six different plots for for the AWGN channel (σb = 0). In summary, the each of the σb values {0, 1, 2, 3, 4, 5}. In each plot we hybrid method is more robust than LDPC, obtain- Neural Enhanced Belief Propagation on Factor Graphs

Model True Graphical Model (u = 0) Mismatch (u = 0.2) Mismatch (u = 0.4) Mismatch (u = 0.8) FG-GNN 0.0141 0.0570 0.1170 0.1659 BP 0.0190 0.0711 0.2081 0.3121 BP (damping) 0.0055 0.0519 0.1318 0.1961 NEBP 0.0091 0.0509 0.1057 0.1697

Table 3: KL divergence between the true marginals p(x) and the estimated marginals for the Ising model dataset.

ing competitive results to LDPC for AWGN channels from a uniform probability distribution uij ∼ U(0, u). but still outperforming it when bursty interferences The Ising model is a loopy graphical model where are present. The FG-GNN instead, obtains relatively the performance of Belief Propagation may signifi- poor performance compared to LDPC for small σb, cantly degrade due to cyclic information. Therefore, in demonstrating that belief propagation is still a very this experiment we include a stronger baseline where powerful tool compared to pure learned inference for Belief Propagation messages are damped to reduce this task. Our NEBP is able to combine the benefits the effect of cyclic information (Koller and Friedman from both LDPC and the FG-GNN to achieve the best 2009). This modification significantly increased its performance, exploiting the adaptability of FG-GNN performance. In contrast, in the previous experiment and the prior knowledge of Belief Propagation. Fi- (LDPC), damping didn’t lead to improvements. This nally, LDPC-bursty shows a more robust performance is coherent with the fact that LDPC graphs are very as we increase σb but it is significantly outperformed by sparse which minimizes cyclic information and Belief NEBP in bursty channels, and it also performs slightly Propagation performs optimally for non cyclic graphs. worse than LDPC for the AWGN channel (σ = 0). b Implementation details: Belief Propagation takes In order to better visualize the decrease in performance the previously defined factors fi and fij as the input as the burst variance increases, we sweep over different factors fs defined in Section 2.2. In the FG-GNN base- σb values for a fixed SNR=3. The result is shown in line, values bi and Jij are inputted as variable and fac- Figure 4. The larger σb, the larger the BER. However, tor attributes av and af from Table 2 respectively. Fi- the performance decreases much less for our NEBP nally, our NEBP combines FG-GNN and BP messages method than for LDPC and LDPC-bursty. In other as explained in Section 4.2. The damping parameter words, NEBP is more robust as we move away from was chosen by cross-validating on the validation parti- the AWGN assumption. We want to emphasize that in tion. All algorithms are run for 10 iterations. Results real world scenarios, the channel may always deviate have been averaged over three runs. All methods have from gaussian. Even if assuming an AWGN channel, been trained for 400 epochs, Adam optimizer, batch its parameters (SNR) must be estimated in real scenar- size 1 and learning rate 1e−5. All hidden layers have ios. This potential deviations make hybrid methods a 64 neurons and the ’Leaky Relu’ (Xu et al. 2015) as very promising approach. activation function. Results: In Table 3 we present the KL divergence be- 5.2 Ising model tween the estimated marginals and the ground truth p(x) for each model trained for different u values. In this section we evaluate our algorithm in a Binary Since we are working with cyclic data, BP is not an Markov Random field, specifically the Ising model. We exact inference algorithm even when provided with the consider a squared lattice type Ising model defined by true graphical model (u = 0). Damping decreases the 1 the following energy function p(x) = Z exp(b · x + x · cyclic information such that for this setting (u = 0) J · x), with variables x ∈ {+1, −1}|V|. Where b bi- Damping Belief Propagation gives the best perfor- ases individual variables and J couples pairs of neigh- mance. NEBP gets a significant margin w.r.t. BP 2 bor variables, bi ∼ N (0, 0.25 ), Jij ∼ N (0, 1) and |V| and FG-GNN for (u = 0) and it outperforms other is the number of variables which is set to 16. This methods when the mismatch u increases. As u gets energy function can be equally expressed as a prod- larger, we reach a point (i.e. u = 0.8) where BP mes- xibi uct of singleton factors fi(xi) = e and pairwise sages are no longer beneficial such that the learned Jij xixj factors fij(xi, xj) = e . The task is to obtain messages (FG-GNN) alone perform best while we still the marginal probabilities p(xi) given the factor graph get a close performance with NEBP. p(x). As in the LDPC experiment, we assume some hidden dynamics not defined in the graphical model that difficult the inference task, we do that by adding a soft interaction among all pairs of variables sampled Victor Garcia Satorras, Max Welling

6 Conclusions Gallager, Robert (1962). “Low-density parity-check codes”. In: IRE Transactions on information the- In this work, we presented a hybrid inference method ory 8.1, pp. 21–28. that enhances Belief Propagation by co-jointly running Gilbert, Edgar N (1960). “Capacity of a burst-noise a Graph Neural Network that we designed for factor channel”. In: Bell system technical journal 39.5, graphs. In cases where the data generating process is pp. 1253–1265. not fully known (e.g. the parameters of the graphical Gilmer, Justin et al. (2017). “Neural message passing model need to be estimated from data), belief propa- for quantum chemistry”. In: Proceedings of the 34th gation doesn’t perform optimally. Our hybrid model International Conference on Machine Learning- in contrast is able to combine the prior knowledge en- Volume 70. JMLR. org, pp. 1263–1272. coded in the graphical model (albeit with the wrong Johnson, Matthew J et al. (2016). “Composing graphi- parameters) and combine this with a (factor) graph cal models with neural networks for structured rep- neural network with its parameters learned from la- resentations and fast inference”. In: Advances in beled data on a representative distribution of chan- neural information processing systems, pp. 2946– nels. Note that we can think of this as meta-learning 2954. because the FG-GNN is not trained on one specific Kim, Hyeji et al. (2018). “Communication algo- channel but on a distribution of channels and therefore rithms via deep learning”. In: arXiv preprint must perform well on any channel sampled from this arXiv:1805.09317. distribution without knowing its specific parameters. Kingma, Diederik P and Jimmy Ba (2014). “Adam: We tested our ideas on a state-of-the-art LDPC im- A method for stochastic optimization”. In: arXiv plementation with realistic bursty noise distributions. preprint arXiv:1412.6980. Our experiments clearly show that the neural enhance- Kipf, Thomas N and Max Welling (2016). “Semi- ment of LDPC improves performance both relative to supervised classification with graph convolutional LDPC and relative to FG-GNN as the variance in the networks”. In: arXiv preprint arXiv:1609.02907. bursts gets larger. Klambauer, G¨unter et al. (2017). “Self-normalizing neural networks”. In: Advances in neural informa- References tion processing systems, pp. 971–980. Koller, Daphne and Nir Friedman (2009). Probabilistic graphical models: principles and techniques. MIT Andrychowicz, Marcin et al. (2016). “Learning to learn press. by gradient descent by gradient descent”. In: Ad- Kuck, Jonathan et al. (2020). “Belief Propa- vances in neural information processing systems, gation Neural Networks”. In: arXiv preprint pp. 3981–3989. arXiv:2007.00295. Bengio, Yoshua (2017). “The consciousness prior”. In: Liu, Ye-Hua and David Poulin (2019). “Neural arXiv preprint arXiv:1709.08568. belief-propagation decoders for quantum error- Bishop, Christopher M (2006). Pattern recognition and correcting codes”. In: Physical review letters machine learning. springer. 122.20, p. 200501. Braunstein, Alfredo and Riccardo Zecchina (2004). Loeliger, H-A (2004). “An introduction to factor “Survey propagation as local equilibrium equa- graphs”. In: IEEE Signal Processing Magazine tions”. In: Journal of Statistical Mechanics: Theory 21.1, pp. 28–41. and Experiment 2004.06, P06007. MacKay, David and Error-Correcting Codes (2009). Bruna, Joan et al. (2013). “Spectral networks and “David MacKay’s Gallager code resources”. In: locally connected networks on graphs”. In: arXiv URL: http://www. inference. phy. cam. ac. preprint arXiv:1312.6203. uk/mackay/CodesFiles. html. Chen, Liang-Chieh et al. (2014). “Semantic im- MacKay, David JC (2003). , infer- age segmentation with deep convolutional nets ence and learning algorithms. Cambridge univer- and fully connected crfs”. In: arXiv preprint sity press. arXiv:1412.7062. MacKay, David JC and Radford M Neal (1996). Chung, Junyoung et al. (2014). “Empirical evaluation “Near Shannon limit performance of low density of gated recurrent neural networks on sequence parity check codes”. In: Electronics letters 32.18, modeling”. In: arXiv preprint arXiv:1412.3555. pp. 1645–1646. Defferrard, Micha¨el,Xavier Bresson, and Pierre Van- Marino, Joseph, Yisong Yue, and Stephan Mandt dergheynst (2016). “Convolutional neural networks (2018). “Iterative amortized inference”. In: arXiv on graphs with fast localized spectral filtering”. In: preprint arXiv:1807.09356. Advances in neural information processing systems, pp. 3844–3852. Neural Enhanced Belief Propagation on Factor Graphs

McEliece, Robert J., David J. C. MacKay, and Jung- the IEEE international conference on computer vi- Fu Cheng (1998). “Turbo decoding as an instance sion, pp. 1529–1537. of Pearl’s” belief propagation” algorithm”. In: IEEE Journal on selected areas in communications 16.2, pp. 140–152. Mirowski, Piotr and Yann LeCun (2009). “Dy- namic factor graphs for time series modeling”. In: Joint European Conference on Machine Learning and Knowledge Discovery in Databases. Springer, pp. 128–143. Murphy, Kevin, Yair Weiss, and Michael I Jordan (2013). “Loopy belief propagation for approximate inference: An empirical study”. In: arXiv preprint arXiv:1301.6725. Murphy, Kevin P (2012). Machine learning: a proba- bilistic perspective. MIT press. Nachmani, Eliya, Yair Be’ery, and David Burshtein (2016). “Learning to decode linear codes using deep learning”. In: 2016 54th Annual Allerton Confer- ence on Communication, Control, and Computing (Allerton). IEEE, pp. 341–346. Pearl, Judea (2014). Probabilistic reasoning in intelli- gent systems: networks of plausible inference. Else- vier. Putzky, Patrick and Max Welling (2017). “Recurrent inference machines for solving inverse problems”. In: arXiv preprint arXiv:1706.04008. Richardson, Matthew and Pedro Domingos (2006). “Markov logic networks”. In: Machine learning 62.1-2, pp. 107–136. Satorras, Victor Garcia, Zeynep Akata, and Max Welling (2019). “Combining Generative and Dis- criminative Models for Hybrid Inference”. In: arXiv preprint arXiv:1906.02547. Schmidhuber, J¨urgen(1987). “Evolutionary principles in self-referential learning, or on learning how to learn: the meta-meta-... hook”. PhD thesis. Tech- nische Universit¨atM¨unchen. Welch, Greg, Gary Bishop, et al. (1995). “An intro- duction to the Kalman filter”. In. Xu, Bing et al. (2015). “Empirical evaluation of rec- tified activations in convolutional network”. In: arXiv preprint arXiv:1505.00853. Yedidia, Jonathan S, William T Freeman, and Yair Weiss (2003). “Understanding belief propagation and its generalizations”. In: Exploring artificial in- telligence in the new millennium 8, pp. 236–239. Yoon, KiJung et al. (2018). “Inference in probabilistic graphical models by graph neural networks”. In: arXiv preprint arXiv:1803.07710. Zhang, Zhen, Fan Wu, and Wee Sun Lee (2019). “Fac- tor Graph Neural Network”. In: arXiv preprint arXiv:1906.00554. Zheng, Shuai et al. (2015). “Conditional random fields as recurrent neural networks”. In: Proceedings of