The Generalized Distributive Law Srinivas M
Total Page:16
File Type:pdf, Size:1020Kb
IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 46, NO. 2, MARCH 2000 325 The Generalized Distributive Law Srinivas M. Aji and Robert J. McEliece, Fellow, IEEE Abstract—In this semitutorial paper we discuss a general (We summarize (1.1) by saying that is obtained by message passing algorithm, which we call the generalized dis- “marginalizing out” the variables and from the function tributive law (GDL). The GDL is a synthesis of the work of many . Similarly, is obtained by marginalizing authors in the information theory, digital communications, signal processing, statistics, and artificial intelligence communities. It out , , and from the same function.) How many arithmetic includes as special cases the Baum–Welch algorithm, the fast operations (additions and multiplications) are required for this Fourier transform (FFT) on any finite Abelian group, the Gal- task? If we proceed in the obvious way, we notice that for each lager–Tanner–Wiberg decoding algorithm, Viterbi’s algorithm, of the values of there are terms in the sum defining the BCJR algorithm, Pearl’s “belief propagation” algorithm, the , each term requiring one addition and one multiplica- Shafer–Shenoy probability propagation algorithm, and the turbo decoding algorithm. Although this algorithm is guaranteed to give tion, so that the total number of arithmetic operations required exact answers only in certain cases (the “junction tree” condition), for the computation of is . Similarly, computing unfortunately not including the cases of GTW with cycles or requires operations, so computing both and turbo decoding, there is much experimental evidence, and a few using the direct method requires operations. theorems, suggesting that it often works approximately even when On the other hand, because of the distributive law, the sum in it is not supposed to. (1.1) factors Index Terms—Belief propagation, distributive law, graphical models, junction trees, turbo codes. (1.3) I. INTRODUCTION HE humble distributive law, in its simplest form, states that Using this fact, we can simplify the computation of . T . The left side of this equation involves First we compute tables of the functions and three arithmetic operations (one addition and two multiplica- defined by tions), whereas the right side needs only two. Thus the distribu- tive law gives us a “fast algorithm” for computing . The object of this paper is to demonstrate that the distributive law can be vastly generalized, and that this generalization leads to (1.4) a large family of fast algorithms, including Viterbi’s algorithm and the fast Fourier transform (FFT).To give a better idea of the potential power of the distributive law and to introduce the view- which requires a total of additions. Then we compute point we shall take in this paper, we offer the following example. the values of using the formula (cf. (1.3)) Example 1.1: Let and be given (1.5) real-valued functions, where , , , and are variables taking values in a finite set with elements. Suppose we are given which requires multiplications. Thus by exploiting the dis- the task of computing tables of the values of and , tributive law, we can reduce the total number of operations re- defined as follows: quired to compute from to . Similarly, the distributive law tells us that (1.2) can be written as (1.1) (1.2) (1.6) Manuscript received July 8, 1998; revised September 23, 1999. This work was supported by NSF under Grant NCR-9505975, AFOSR under Grant 5F49620-97-1-0313, and a Grant from Qualcomm. A portion of McEliece’s where is as defined in (1.4). Thus if we precompute a contribution was performed at the Sony Corporation in Tokyo, Japan, while he table of the values of ( operations), and then use (1.6) was a holder of a Sony Sabbatical Chair. Preliminary versions of this paper ( further operations), we only need operations (as were presented at the IEEE International Symposium on Information Theory, Ulm, Germany, June 1997, and at ISCTA 1997, Ambleside U.K., July 1997. compared to for the direct method) to compute the values The authors are with the Department of Electrical Engineering, California of . Institute of Technology, Pasadena, CA 91125 USA (e-mail: {mas; rjm}@sys- Finally, we observe that to compute both and tems.caltech.edu). Communicated by F. R. Kschischang, Associate Editor for Coding Theory. using the simplifications afforded by (1.3) and (1.6), we only Publisher Item Identifier S 0018-9448(00)01679-5. need to compute once, which means that we can compute 0018–9448/00$10.00 © 2000 IEEE 326 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 46, NO. 2, MARCH 2000 the values of and with a total of only some recent theoretical work on GDL-like algorithms on graphs operations, as compared to for the direct method. with a single cycle. Although this paper is semitutorial, it contains a number of The simplification in Example 1.1 was easy to accomplish, things which have not appeared previously. Beside the gener- and the gains were relatively modest. In more complicated ality of our exposition, these include: cases, it can be much harder to see the best way to reorganize the calculations, but the computational savings can be dramatic. • A de-emphasis of a priori graphical models, and an em- It is the object of this paper to show that problems of the type phasis on algorithms to construct graphical models to fit described in Example 1.1 have a wide range of applicability, the given problem. and to describe a general procedure, which we called the gen- • A number of nonprobabilistic applications, including the eralized distributive law (GDL), for solving them efficiently. FFT. Roughly speaking, the GDL accomplishes its goal by passing • A careful discussion of message scheduling, and a proof messages in a communications network whose underlying of the correctness of a large class of possible schedules. graph is a tree. • A precise measure of the computational complexity of the Important special cases of the GDL have appeared many GDL. times previously. In this paper, for example, we will demon- Finally, we note that while this paper was being written, strate that the GDL includes as special cases the fast Hadamard Kschischang, Frey, and Loeliger [41] were simultaneously transform, Viterbi’s algorithm, the BCJR algorithm, the and independently working out a similar synthesis. And while Gallager–Tanner–Wiberg decoding algorithm (when the under- the final forms of the two papers have turned out to be quite lying graph is cycle-free), and certain “probability propagation” different, anyone interested in the results of this paper should algorithms known in the artificial intelligence community. have a look at the alternative formulation in [41]. With a little more work, we could have added the FFT on any finite Abelian group, the Baum–Welch “forward-backward” II. THE MPF PROBLEM algorithm, and discrete-state Kalman filtering. Although this The GDL can greatly reduce the number of additions and paper contains relatively little that is essentially new (for multiplications required in a certain class of computational example, the 1990 paper of Shafer and Shenoy [33] describes problems. It turns out that much of the power of the GDL is an algorithm similar to the one we present in Section III), we due to the fact that it applies to situations in which the notions believe it is worthwhile to present a simply stated algorithm of addition and multiplication are themselves generalized. The of such wide applicability, which gives a unified treatment of appropriate framework for this generalization is the commuta- a great many algorithms whose relationship to each other was tive semiring. not fully understood, if sensed at all. Here is an outline of the paper. In Section II, we will state a Definition: A commutative semiring is a set , together with general computational problem we call the MPF (“marginalize two binary operations called “ ” and “ ”, which satisfy the fol- a product function”) problem, and show by example that lowing three axioms: a number of classical problems are instances of it. These S1. The operation “ ” is associative and commutative, and problems include computing the discrete Hadamard transform, there is an additive identity element called “ ” such that maximum-likelihood decoding of a linear code over a memo- for all . (This axiom makes ryless channel, probabilistic inference in Bayesian networks, a commutative monoid.) a “probabilistic state machine” problem, and matrix chain S2. The operation “ ” is also associative and commutative, multiplication. In Section III, we shall give an exact algorithm and there is a multiplicative identity element called “ ” for solving the MPF problem (the GDL) which often gives such that for all . (Thus is also an efficient solution to the MPF problem. In Section IV we a commutative monoid.) will discuss the problem of finding junction trees (the formal S3. The distributive law holds, i.e., name for the GDL’s communication network), and “solve” the example instances of the MPF problem given in Section II, thereby deriving, among other things, the fast Hadamard for all triples from . transform and Viterbi’s algorithm. In Section V we will discuss the computational complexity of the GDL. (A proof of the The difference between a semiring and a ring is that in a correctness of the GDL is given in the Appendix.) semiring, additive inverses need not exist, i.e., is only In Section VI, we give a brief history of the GDL. Finally, required to be a monoid, not a group. Thus every commutative in Section VII, we speculate on the possible existence of an ring is automatically a commutative semiring. For example, the efficient class of approximate, iterative, algorithms for solving set of real or complex numbers, with ordinary addition and mul- the MPF problem, obtained by allowing the communication net- tiplication, forms a commutative semiring.