Empirical Comparison of Greedy Strategies for Learning Markov Networks of Treewidth K
Total Page:16
File Type:pdf, Size:1020Kb
Empirical Comparison of Greedy Strategies for Learning Markov Networks of Treewidth k K. Nunez, J. Chen and P. Chen G. Ding and R. F. Lax B. Marx Dept. of Computer Science Dept. of Mathematics Dept. of Experimental Statistics LSU LSU LSU Baton Rouge, LA 70803 Baton Rouge, LA 70803 Baton Rouge, LA 70803 Abstract—We recently proposed the Edgewise Greedy Algo- simply one less than the maximum of the orders of the rithm (EGA) for learning a decomposable Markov network of maximal cliques [5], a treewidth k DMN approximation to treewidth k approximating a given joint probability distribution a joint probability distribution of n random variables offers of n discrete random variables. The main ingredient of our algorithm is the stepwise forward selection algorithm (FSA) due computational benefits as the approximation uses products of to Deshpande, Garofalakis, and Jordan. EGA is an efficient alter- marginal distributions each involving at most k + 1 of the native to the algorithm (HGA) by Malvestuto, which constructs random variables. a model of treewidth k by selecting hyperedges of order k +1. In The treewidth k DNM learning problem deals with a given this paper, we present results of empirical studies that compare empirical distribution P ∗ (which could be given in the form HGA, EGA and FSA-K which is a straightforward application of FSA, in terms of approximation accuracy (measured by of a distribution table or in the form of a set of training data), KL-divergence) and computational time. Our experiments show and the objective is to find a DNM (i.e., a chordal graph) that (1) on the average, all three algorithms produce similar G of treewidth k which defines a probability distribution Pµ approximation accuracy; (2) EGA produces comparable or better that ”best” approximates P ∗ according to some criterion. For approximation accuracy and is the most efficient among the three. example, the criterion could be minimizing D (P ∗jjP ), (3) Malvestuto’s algorithm is the least efficient one, although it KL µ tends to produce better accuracy when the treewidth is bigger where DKL(pjjq) is the Kullback-Leibler (KL) divergence than half of the number of random variabls; (4) EGA coupled measuring the difference between two probability distributions with local search has the best approximation accuracy overall, of p and q. For the special case k = 1, a graph of treewidth at a cost of increased computation time by 50 percent. one is a tree. The problem of finding an optimal maximum likelihood tree approximating a given probability distribution I. INTRODUCTION is solved by a well-known work greedy algorithm due to Chow A Markov network is an undirected graphical model of a and Liu [6]. In general finding the optimal graph G for k > 1 joint probability distribution that captures conditional depen- is computationally hard (see [13]), thus greedy algorithms dency among subsets of the random variables involved. Here, have been developed that seek a good (e.g., with small KL- we consider only decomposable Markov networks (DMNs) of divergence to P ∗) DNM of treewidth k as an approximation n discrete random variables. This means that the graphical to P ∗. model is a chordal (or triangulated) graph, and the joint Malvestuto [16] gave a greedy algorithm that constructs a probability distribution factors in a particularly simple way so-called elementary DNM of treewidth k by adding at each involving the product of the marginal distributions of all max- step a hyperedge of order k+1. His algorithm will be called the imal cliques and the product of the marginal distributions of Hyperedge Greedy Algorithm (HGA). While HGA employs intersections of certain maximal cliques (see [19]Lauritzen and a very reasonable greedy strategy to find a good treewidth Theorem 1 below). DMNs have become increasingly popular k DMN, computationally HGA appears to be much less in applications such as computer vision, natural language efficient because it needs to inspect all the cliques of size processing and information retrieval. k + 1 in growing the chordal graph. In fact the computational n Recall that a graph is called chordal if every (simple) cycle complexity of HGA is of the order O( k ). The Edgewise of length at least four has a chord. In this paper, a k-clique is Greedy Algorithm (EGA) [10] proposed by the authors is a a complete subgraph on k vertices of a graph, and a maximal more efficient alternative to HGA. The EGA’s greedy strategy complete subgraph, which some authors call a clique, will be is adding one edge at a time based on maximum conditional called a maximal clique. mutual information. The EGA modifies the stepwise forward The greedy algorithms compared in this paper tackle the selection algorithm (FSA) of Deshpande-Garofalakis-Jordan problem of finding a chordal graph (DMN) of given treewidth [9] by taking into account the treewidth of the model, creating k that best approximates (in some sense) a given probability chordal graphs of increasing treewidth to approximate P ∗. distribution. Given that for chordal graphs, the treewidth is This algorithm can be seen as a generalization of the Chow-Liu algorithm. A local search method was also proposed in [10] factorization of the joint probability distribution that comes to further improve the learned model. Using a straightforward from this decomposable graphical model. application of the FSA by Deshpande-Garofalakis-Jordan [9], Theorem 1: The joint probability distribution and the we can obtain an algorithm FSA-K which constructs a DNM marginal distributions coming from this decomposable graph- of treewidth k by adding one edge at a time from a null graph ical model satisfy without having to first construct DNMs of treewidth 1, etc. Y Y The EGA and FSA-K methods have the same computational pX · pS = pC : complexity, as both utilize F SA as the core component. S2S C2Vc It has been shown [9] that at each edge-addition step, FSA Definition. Let P ∗(X) be a probability distribution with 2 takes O(n ) time to enumerate all the eligible edges to be n random variables X = fX1;X2;:::;Xng. Let the chordal considered, and makes O(n) entropy computations. For an graph G be a DMN defining a probability distribution Pµ(X) ∗ edge-maximal treewidth k chordal graph G with n vertices, approximating P (X). Then Pµ(X) is given by (k+1) G has exactly nk − 2 edges which is of the order nk, Q ∗ P C (XC ) the time complexity of EGA and FSA-K is O([n2 + nE]nk) C2Vc Pµ(X) = Q ∗ where E is the maximal amount of time needed to compute S2S P S(XS); an entropy of k + 1 variables. This indicates that EGA and ∗ where P C (XC ) for a clique C in graph G is the marginal FSA-K are much more efficient than HGA. However, there ∗ distribution from P on the variables XC in the clique C. is no empirical study comparing the approximation accuracies Recall that the entropy H(X) for a set of random vari- of HGA, EGA, and FSA-K, although two examples where ables X (following a distribution P ) is defined to be shown in [10] indicating cases where EGA obtains the optimal H(X) = − P P (x) log P (x). We sometimes denote the solution whereas HGA does not. x entropy H(X) by HP (X) when it is necessary to distinguish In this paper, we report a series of experiments that com- the underlying probability distributions when computing the pare HGA, EGA, FSA-K and EGA with local search. The entropy. In this paper, when we just use H(X) (or H(C)) comparisons are along two aspects: one is the approximation without the subsecipt, we refer to the entropy of the varaibles ∗ accuracy measured by DKL(P jj P µ), and the other is the under the empirical distribution P ∗ or its marginal distribu- computational time spent in constructing the DNMs. tions. The rest of the paper is organized as follows. Some neces- The KL-divergence DKL(pjjq) between probability distribu- sary background information will be presented in Section II. P p(x) tions p and q is defined to be DKL(pjjq) = x p(x) log q(x) . The greey algorithms HGA, EGA and FSA-K will be briefly Proposition 1. The difference between the entropy of a described in Section III. Section IV contains the main part DMN G and the entropy of the “saturated” model (i.e., the of the paper, namely, the experiments settings, results and DMN given by the complete graph on the n discrete random discussions. Section V concludes the paper. variables) equals the Kullback-Leibler divergence between P ∗ and Pµ, namely, II. BACKGROUND ∗ DKL(P jjPµ) = HP (X) − H(X) The notion of a junction tree is useful in defining the µ probability distribution on a DMN G. It will be convenient Thus in finding the best (in terms of KL-divergence) DMN of to define a junction tree in terms of the clique graph, in the treewidth k, it is necessary and sufficient to find a treewidth sense of [11], of a chordal graph. k DNM that minimizing HPµ (X), Definition: Let G = (V; E) be a chordal graph. The clique III. THREE GREEDY ALGORITHMS graph of G, denoted C(G) = (Vc;Ec; µ) is the weighted graph defined by: A. Malvestuto’s Algorithm 1) The vertex set Vc is the set of maximal cliques of G. In Malvestuto’s article, a chordal graph of treewidth k is 2) An edge (C1;C2) belongs to Ec if and only if the called a model of rank k + 1.