The Inclusion-Exclusion Rule and Its Application to the Junction Tree
Total Page:16
File Type:pdf, Size:1020Kb
Proceedings of the Twenty-Third International Joint Conference on Artificial Intelligence The Inclusion-Exclusion Rule and Its Application to the Junction Tree Algorithm David Smith Vibhav Gogate Department of Computer Science Department of Computer Science The University of Texas at Dallas The University of Texas at Dallas Richardson, TX, 75080, USA Richardson, TX, 75080, USA [email protected] [email protected] Abstract teescu et al., 2008], etc.) developed in the knowledge com- pilation community are also receiving increasing attention In this paper, we consider the inclusion-exclusion in the machine learning community. In particular, recently, rule – a known yet seldom used rule of probabilistic there has been a push to learn tractable models directly inference. Unlike the widely used sum rule which from data instead of first learning a PGM and then compil- requires easy access to all joint probability values, ing it into a tractable model (cf. [Bach and Jordan, 2001; the inclusion-exclusion rule requires easy access to Poon and Domingos, 2011]). several marginal probability values. We therefore develop a new representation of the joint distri- In this paper, we focus on junction trees (a tree of clus- bution that is amenable to the inclusion-exclusion ters of variables) – one of the oldest yet most widely used rule. We compare the relative strengths and weak- tractable subclass of PGMs – and seek to improve their query nesses of the inclusion-exclusion rule with the sum complexity. One of the main drawbacks of using junction rule and develop a hybrid rule called the inclusion- trees as a target for knowledge compilation is that the time exclusion-sum (IES) rule, which combines their required to process each cluster is exponential in the num- power. We apply the IES rule to junction trees, ber of variables in the cluster. Thus, when the cluster size is treating the latter as a target for knowledge com- large (e.g., a few Gigabytes), even for easy queries, such as pilation and show that in many cases it greatly re- computing the marginal distribution over a subset of variables duces the time required to answer queries. Our ex- in a cluster, the method can be prohibitively expensive. We periments demonstrate the power of our approach. envision that such large junction trees will be stored on ex- In particular, at query time, on several networks, ternal storage devices such as hard disk drives or solid state [ ] our new scheme was an order of magnitude faster drives Kask et al., 2010; Kyrola et al., 2012 . Since disk than the junction tree algorithm. I/O is much slower than RAM I/O, we want to minimize it. The method described in this paper will minimize disk I/O in many cases by allowing us to safely ignore large portions of 1 Introduction each cluster, which in turn will reduce the query complexity. Popular probabilistic inference algorithms such as bucket We achieve this reduction in query complexity by consid- elimination [Dechter, 1999], recursive conditioning [Dar- ering alternate representations of potentials and developing wiche, 2001] and Belief propagation [Murphy et al., 1999; efficient manipulation algorithms for these representations. Yedidia et al., 2005] leverage only the sum and product rules Traditionally, each cluster potential in a junction tree is rep- of probability theory. In this paper, we consider the inclusion- resented as a weighted truth table: each row in the table rep- exclusion rule, a known yet seldom used rule of probabilistic resents a truth assignment to all variables in the potential and inference, and show that it presents several new opportunities is associated with a non-negative real number (weight). Al- for scaling up inference, specifically in the context of knowl- though this representation enables efficient message passing edge compilation [Darwiche and Marquis, 2002]. algorithms, which require taking the product of potentials and Knowledge compilation is a popular approach for tackling summing out variables from potentials, it has a significant dis- the intractability of probabilistic inference. The key idea is advantage. Inferring the probability of a conjunctive (proba- to compile the probabilistic graphical model (PGM) into a bility of evidence) query defined over a subset of variables in tractable structure such that several desired classes of queries a cluster is exponential in the number of variables not men- can be answered in time that is polynomial (linear) in the size tioned in the query and therefore queries that mention only a of the structure. The bulk of the computational overhead is few variables are computationally expensive. thus pushed into the offline, compilation phase while queries To remedy this problem, we propose an alternate repre- can be answered quickly in an online manner. Tractable lan- sentation. Given a truth assignment to all variables in a po- guages (e.g., junction or join trees [Lauritzen and Spiegelhal- tential, considered as a set of literals (e.g., the assignment ter, 1988; Dechter and Pearl, 1989], Arithmetic circuits [Dar- A =0, B =1is the set {¬a, b}), we represent the poten- wiche, 2003], AND/OR multi-valued decision diagrams [Ma- tial using the power set of this set of literals. Namely, we 2568 represent the potential by attaching a weight to each subset ments to all variables in X . Since each full assignment cor- of the set of literals. We develop an alternate inference rule, responds to a full conjunctive feature, we can also think of based the inclusion-exclusion principle, that uses this repre- φ as a mapping between full conjunctive features and R+. sentation (we call it the inclusion-exclusion representation) For instance, a function over {A, B} can be specified using to efficiently compute the probability of conjunctive queries. φ : {(¬a ∧¬b), (¬a ∧ b), (a ∧¬b), (a ∧ b)}→R+. We will The main virtue of this new rule is that the probability of the use the term weight to describe the real number associated conjunctive query can be computed in time that is exponential with each conjunctive feature. For the rest of the paper, when only in the number of literals in the query that are not present φ is specified using full conjunctive features, we will say that in the represented set of literals. Thus, when the number of φ is specified using the sum-format. query variables is small or when most literals in the query The probability distribution associated with φ is P (f)= are present in the represented set of literals, the inclusion- φ(f)/Z where f is a full conjunctive feature and Z is the nor- exclusion rule is exponentially more efficient than the con- malization constant, often called the partition function. As- ventional approach. suming that the partition function is known (computed offline However, a key drawback of the inclusion-exclusion repre- and stored), we can infer the probability of any conjunctive sentation is that computing products of potentials represented query using the sum rule of probability theory: in the inclusion-exclusion format is very inefficient. As a re- 1 sult, we cannot use it for representing potentials in junction Pr(q)= φ(q ∧ f) Z trees because the Belief propagation algorithm that is used f∈F(X\Vars(q)) to answer queries over the junction tree requires computing where Vars(q) denotes the set of variables mentioned in q products of potentials. We therefore develop a hybrid repre- and F(X\Vars(q)) is the set of full conjunctive features of sentation (and a hybrid inference rule) which combines the X\Vars(q). We assume that q is non-trivial in that no two power of the conventional truth-table representation with the literals in q mention the same variable. inclusion-exclusion representation, and apply it to the junc- tion tree algorithm. Our new algorithm takes variables that Example 1. Fig. 1(a) shows how the sum rule can be used to are exclusive to a cluster and instead of storing their assign- compute the probability of various queries. ments using a truth-table, uses the inclusion-exclusion ap- Proposition 1. Assuming that Z is known and the distribu- proach to represent their most likely assignment. As a result, tion is represented using full conjunctive features of X , the our most common queries have smaller complexity than our time complexity of computing the probability of a conjunctive least common queries. query containing k literals is O(exp(|X | − k)). We evaluated the efficacy of our new approach, compar- Since the complexity of inference using the sum rule is ex- ing it with the junction tree algorithm on several randomly ponential in |X | − k, marginal queries over a small number generated and benchmark PGMs. Our results show that in of variables (i.e., k is small) are computationally expensive. many cases, at query time, our new algorithm was signifi- In this paper, we argue that alternate descriptions of D can, in cantly faster than the junction tree algorithm, clearly demon- many cases, provide computationally cheaper alternatives. strating its power and promise. 3 The Inclusion-Exclusion (IE) Rule 2 The Sum Rule Instead of describing φ in terms of weights attached to full Notation. Let D denote a discrete probability distribution de- conjunctive features over X , we can describe it in terms of fined over a set X = {X1,...,Xn} of binary variables that weights attached to all possible positive conjunctive fea- take values from the domain {True, False }. (We assume tures over X (we say that a conjunctive feature is positive binary variables for convenience. Our method is general and if none of its literals are negated). For example, to describe can be easily extended to multi-valued variables.) A literal is a probability distribution over the set of random variables a variable or its negation.