Empirical Comparison of Greedy Strategies for Learning Markov Networks of k

K. Nunez, J. Chen and P. Chen G. Ding and R. F. Lax B. Marx

Dept. of Computer Science Dept. of Mathematics Dept. of Experimental Statistics LSU LSU LSU Baton Rouge, LA 70803 Baton Rouge, LA 70803 Baton Rouge, LA 70803

Abstract—We recently proposed the Edgewise Greedy Algo- simply one less than the maximum of the orders of the rithm (EGA) for learning a decomposable Markov network of maximal cliques [5], a treewidth k DMN approximation to treewidth k approximating a given joint probability distribution a joint probability distribution of n random variables offers of n discrete random variables. The main ingredient of our algorithm is the stepwise forward selection algorithm (FSA) due computational benefits as the approximation uses products of to Deshpande, Garofalakis, and Jordan. EGA is an efficient alter- marginal distributions each involving at most k + 1 of the native to the algorithm (HGA) by Malvestuto, which constructs random variables. a model of treewidth k by selecting hyperedges of order k +1. In The treewidth k DNM learning problem deals with a given this paper, we present results of empirical studies that compare empirical distribution P ∗ (which could be given in the form HGA, EGA and FSA-K which is a straightforward application of FSA, in terms of approximation accuracy (measured by of a distribution table or in the form of a set of training data), KL-divergence) and computational time. Our experiments show and the objective is to find a DNM (i.e., a chordal graph) that (1) on the average, all three algorithms produce similar G of treewidth k which defines a probability distribution Pµ approximation accuracy; (2) EGA produces comparable or better that ”best” approximates P ∗ according to some criterion. For approximation accuracy and is the most efficient among the three. example, the criterion could be minimizing D (P ∗||P ), (3) Malvestuto’s algorithm is the least efficient one, although it KL µ tends to produce better accuracy when the treewidth is bigger where DKL(p||q) is the Kullback-Leibler (KL) divergence than half of the number of random variabls; (4) EGA coupled measuring the difference between two probability distributions with local search has the best approximation accuracy overall, of p and q. For the special case k = 1, a graph of treewidth at a cost of increased computation time by 50 percent. one is a . The problem of finding an optimal maximum likelihood tree approximating a given probability distribution I.INTRODUCTION is solved by a well-known work greedy algorithm due to Chow A Markov network is an undirected graphical model of a and Liu [6]. In general finding the optimal graph G for k > 1 joint probability distribution that captures conditional depen- is computationally hard (see [13]), thus greedy algorithms dency among subsets of the random variables involved. Here, have been developed that seek a good (e.g., with small KL- we consider only decomposable Markov networks (DMNs) of divergence to P ∗) DNM of treewidth k as an approximation n discrete random variables. This means that the graphical to P ∗. model is a chordal (or triangulated) graph, and the joint Malvestuto [16] gave a greedy algorithm that constructs a probability distribution factors in a particularly simple way so-called elementary DNM of treewidth k by adding at each involving the product of the marginal distributions of all max- step a hyperedge of order k+1. His algorithm will be called the imal cliques and the product of the marginal distributions of Hyperedge Greedy Algorithm (HGA). While HGA employs intersections of certain maximal cliques (see [19]Lauritzen and a very reasonable greedy strategy to find a good treewidth Theorem 1 below). DMNs have become increasingly popular k DMN, computationally HGA appears to be much less in applications such as computer vision, natural language efficient because it needs to inspect all the cliques of size processing and information retrieval. k + 1 in growing the chordal graph. In fact the computational n Recall that a graph is called chordal if every (simple) cycle complexity of HGA is of the order O( k ). The Edgewise of length at least four has a chord. In this paper, a k-clique is Greedy Algorithm (EGA) [10] proposed by the authors is a a complete subgraph on k vertices of a graph, and a maximal more efficient alternative to HGA. The EGA’s greedy strategy complete subgraph, which some authors call a clique, will be is adding one edge at a time based on maximum conditional called a maximal clique. mutual information. The EGA modifies the stepwise forward The greedy algorithms compared in this paper tackle the selection algorithm (FSA) of Deshpande-Garofalakis-Jordan problem of finding a chordal graph (DMN) of given treewidth [9] by taking into account the treewidth of the model, creating k that best approximates (in some sense) a given probability chordal graphs of increasing treewidth to approximate P ∗. distribution. Given that for chordal graphs, the treewidth is This algorithm can be seen as a generalization of the Chow-Liu algorithm. A local search method was also proposed in [10] factorization of the joint probability distribution that comes to further improve the learned model. Using a straightforward from this decomposable graphical model. application of the FSA by Deshpande-Garofalakis-Jordan [9], Theorem 1: The joint probability distribution and the we can obtain an algorithm FSA-K which constructs a DNM marginal distributions coming from this decomposable graph- of treewidth k by adding one edge at a time from a null graph ical model satisfy without having to first construct DNMs of treewidth 1, etc. Y Y The EGA and FSA-K methods have the same computational pX · pS = pC . complexity, as both utilize FSA as the core component. S∈S C∈Vc It has been shown [9] that at each edge-addition step, FSA Definition. Let P ∗(X) be a probability distribution with 2 takes O(n ) time to enumerate all the eligible edges to be n random variables X = {X1,X2,...,Xn}. Let the chordal considered, and makes O(n) entropy computations. For an graph G be a DMN defining a probability distribution Pµ(X) ∗ edge-maximal treewidth k chordal graph G with n vertices, approximating P (X). Then Pµ(X) is given by (k+1) G has exactly nk − 2 edges which is of the order nk, Q ∗ P C (XC ) the time complexity of EGA and FSA-K is O([n2 + nE]nk) C∈Vc Pµ(X) = Q ∗ where E is the maximal amount of time needed to compute S∈S P S(XS), an entropy of k + 1 variables. This indicates that EGA and ∗ where P C (XC ) for a clique C in graph G is the marginal FSA-K are much more efficient than HGA. However, there ∗ distribution from P on the variables XC in the clique C. is no empirical study comparing the approximation accuracies Recall that the entropy H(X) for a set of random vari- of HGA, EGA, and FSA-K, although two examples where ables X (following a distribution P ) is defined to be shown in [10] indicating cases where EGA obtains the optimal H(X) = − P P (x) log P (x). We sometimes denote the solution whereas HGA does not. x entropy H(X) by HP (X) when it is necessary to distinguish In this paper, we report a series of experiments that com- the underlying probability distributions when computing the pare HGA, EGA, FSA-K and EGA with local search. The entropy. In this paper, when we just use H(X) (or H(C)) comparisons are along two aspects: one is the approximation without the subsecipt, we refer to the entropy of the varaibles ∗ accuracy measured by DKL(P || P µ), and the other is the under the empirical distribution P ∗ or its marginal distribu- computational time spent in constructing the DNMs. tions. The rest of the paper is organized as follows. Some neces- The KL-divergence DKL(p||q) between probability distribu- sary background information will be presented in Section II. P p(x) tions p and q is defined to be DKL(p||q) = x p(x) log q(x) . The greey algorithms HGA, EGA and FSA-K will be briefly Proposition 1. The difference between the entropy of a described in Section III. Section IV contains the main part DMN G and the entropy of the “saturated” model (i.e., the of the paper, namely, the experiments settings, results and DMN given by the on the n discrete random discussions. Section V concludes the paper. variables) equals the Kullback-Leibler divergence between P ∗ and Pµ, namely, II.BACKGROUND ∗ DKL(P ||Pµ) = HP (X) − H(X) The notion of a junction tree is useful in defining the µ probability distribution on a DMN G. It will be convenient Thus in finding the best (in terms of KL-divergence) DMN of to define a junction tree in terms of the clique graph, in the treewidth k, it is necessary and sufficient to find a treewidth sense of [11], of a chordal graph. k DNM that minimizing HPµ (X), Definition: Let G = (V,E) be a chordal graph. The clique III.THREEGREEDYALGORITHMS graph of G, denoted C(G) = (Vc,Ec, µ) is the weighted graph defined by: A. Malvestuto’s Algorithm

1) The set Vc is the set of maximal cliques of G. In Malvestuto’s article, a chordal graph of treewidth k is 2) An edge (C1,C2) belongs to Ec if and only if the called a model of rank k + 1. He constructs what he calls intersection C1 ∩ C2 in G is a minimal a, b-separator an elementary uniform model of rank k + 1. Here, “uniform” for each a ∈ C1 \ C2 and each b ∈ C2 \ C1. means that each maximal clique has order k + 1. The term 3) Each edge (Ci,Cj) ∈ Ec is given the weight “elementary” is defined in his paper but we omit the definition µ(Ci,Cj) = |Ci ∩ Cj|. due to lack of space in this paper. Definition: Let G = (V,E) be a chordal graph. A junction Below is a brief description Malvestuto’s Hyperedge Greedy tree of G is a maximal spanning tree of the clique graph of Algorithm (HGA) for constructing a good DMN of treewidth G. k, given a joint probability distribution for discrete random variables X1,X2,...,Xn. Let the chordal graph G be a DMN for the probability Step 1: Among all subsets of {X1,...,Xn} of order k + 1, distribution pX, where X = {X1,X2,...,Xn}. Let T = select the set C1 that minimizes the marginal entropy H(C1). (Vc,Et) be a junction tree for G and let S denote the Step 2: Suppose cliques C1,...,Ch, each of order k + multiset {Ci ∩ Cj |(Ci,Cj) ∈ Et}. The next result gives the 1, have already been chosen. Consider all subsets C of {X1,...,Xn} of order k + 1 such that the chordal graph Step 4: If r = k, stop; else, increment r by 1 and proceed GC with maximal cliques C1,...,Ch,C is elementary. In the to Step 1. associated rooted junction tree of GC , let C0 denote the parent A local search algorithm was also proposed in [10] to further of C and put SC = C0 ∩ C. From these eligible subsets C, reduce the entropy of the learned graph. The local search choose one that minimizes the quantity H(C) − H(SC ). method looks for vertices x, y and w in G such that the Step 3: Stop when all the Xi appear in some clique. (By the edge (x, w) is in the current graph while the edge (x, y) is definition of elementary, this will occur after n−(k+1)+1 = not. The method tries to see whether exchanging such edges n − k cliques of order k + 1 have been selected.) would reduce the entropy of the resulting graph, and if so the exchange would be retained and the local search continues B. The Edgewise Greedy Algorithm until no more such edge exchange is possible. More details The Edgewise Greedy Algorithm (EGA) proposed in [10] about the local search algorithm can be found in [10]. also constructs a good DMN of treewidth k, but proceeds by adding one edge at a time to the model. In considering which C. The FSA-K Algorithm edge to add, we consider edges that will result in a chordal By a straightfoward application of the FSA algorithm, we graph of a given treewidth, and from these we choose one that can obtain another greedy algorithm FSA-K for constructing results in a maximum decrease in entropy. We begin with the a DMN of treewidth k for approximating a given distribution graph with n vertices and no edges and recursively construct P ∗. This algorithm adds one edge at a time starting from the chordal graphs of treewidth 1, 2, . . . , k. null graph. It calls FSA to find all k-admissible edges and then Definition: k-admissible edges Let G be a chordal graph adds one with the maximal conditional mutual information. and let a and b be vertices that are not adjacent in G. We say And then it performs the updates to the clique graph in the that the edge (a, b) is k-admissible if the graph G0 obtained same way as EGA does. This process keeps going until no by adding (a, b) to G is chordal of treewidth at most k. more k-admissible edges are available. Deshpande-Garofalakis-Jordan [9] gave an algorithm for IV. EXPERIMENTS efficiently identifying those edges that can be added to the current chordal graph to result in a chordal graph. which we A. Experiment Design will refer to as the Forward Selection Algorithm (FSA). The A program was written to implement each algorithm and FSA is modified and used in the EGA algorithm to find the experiments were conducted to compare the performance of set of r-admissible edges for a given r ≤ k. each algorithm from data attained by the program. The exper- Once the r-admissible edges (r ≤ k) are enumerated, iments compared two attributes of each algorithm: accuracy we need to choose among them the best one to add to and efficiency. Accuracy was measured by KL-divergence the current graph in building a treewidth r chordal graph. and efficiency was measured by the execution time of the This is done by using conditional mutual information as a program while the algorithm was executing. Each algorithm selection criterion function. Namely, given the current graph has two inputs: the number of variables to be in the joint G and the set Er admissible edges, one should choose an probability distribution, n, and the treewidth of the DMN, k. edge (Xi,Xj) from Er which has maximal conditional mutual The algorithm creates a DMN and returns either the execution information I(Xi; Xj|Sij) among all edges in Er, where time or computes and returns the KL-divergence. For each Sij is the minimal separator of the verices Xi and Xj. For fixed input parameter setting, multiple runs of each algorithm sets X, Y , Z of random variables, the conditional mutual are executed, and the results are averaged over the multiple information of X and Y given Z, denoted I(X; B|Y ), is runs (see below - the number of iterations i described in defined by I(X; Y |Z) = H(X|Z) − H(X|Y,Z), where the next two paragraphs records the number of runs). The H(X|Y ) = H(X,Y ) − H(Y ). experiments that measured KL-divergence and the experiments The EGA Algorithm that measured execution time differ slightly. Input: The joint (empirical) probability distribution of The experiments that measure the accuracy of each al- n discrete random variables X1,X2,...,Xn and a positive gorithm compute an average KL-divergence (over multiple integer k < n runs) for each treewidth. These experiments hold n constant Output: A DMN (chordal graph) with n vertices and and use k as the independent variable. These experiments treewidth k were conducted as follows. Let i be the number of iterations Initialize: Let G be the graph with vertices X1,...,Xn and the algorithms execute before an average is computed. The no edges. Put r = 1. program will generate a random joint probability distribution Step 1: Use the FSA (modified to take into account the of binary variables, P ∗. Using this distribution each algorithm order of the potential new clique) to enumerate all r-admissible will construct a DMN. A new joint probability distribu- edges. If there are no r-admissible edges, proceed to Step 4. tion, Pµ, is attained from the DMN and the KL-divergence, ∗ Step 2: For each r-admissible edge (Xi,Xj), compute DKL(P ||Pµ), is calculated. The process of creating random I(Xi; Xj|Sij) and add an edge with the maximum value. joint probability distributions, constructing a DMN for each Step 3: Perform the necessary updating in the FSA and go algorithm, and computing KL-divergence continues for i iter- to Step 1. ations. Next, for each k value, the average (over the i runs) Fig. 2. Fig. 1.

Figure 1 compares the accuracy of EGA, FSA-K, and HGA. KL-divergence for each algorithm is computed. Once this is In this experiment n is 8 and i is 100. EGA and FSA-K have done, the experiment is complete. similar accuracies, but notice that EGA performs slightly better The experiments that measure the efficiency of each al- than FSA-K overall. HGA has less accuracy for treewidths gorithm compute the average execution time (over multiple 1 and 2 in comparison to the other two algorithms, but has runs) for each values of n. These experiments were conducted greater accuracy for treewidths 3, 4, and 5. as follows. The algorithm to test is selected. Let i be the Figure 2 compares the efficiency of three algorithms. k number of iterations the algorithm executes before an average is 2 and i is 10. The data shows that the EGA executed the is computed. The program will generate a random joint prob- fastest on average than the other two algorithms. The FSA-K ability distribution of binary variables and the algorithm will required a similar amount of time, and the HGA performed construct a DMN from this distribution. The execution time of the slowest. The EGA adds a restriction at each step of the the algorithm will be recorded. This occurs for i iterations and, FSA-K: a new edge cannot be added if it creates a clique with then, an average execution time for the algorithm is computed. a size greater than the current limit. This restriction reduces Once the average execution time is computed for each value of the number of edges eligible for addition to the graph. Thus, n, another algorithm is selected. When all necessary algorithms there will be fewer calculations when determining which edge are selected, the experiment is complete. to be added onto the graph. The restriction, however, also A joint probability distribution is generated by constructing increases the execution time. The experiment indicates that a table of all possible vectors and assigning each one a weight. the calculations for restricting edges cost less execution time Only binary variables will be in the distribution, thus there will than the calculations for considering those edges that would n be 2 vectors. Let ri be a pseudo-random number between 1 not have been eligible. This a reason that the EGA required and 4 that corresponds to a vector, Xi. The weight of Xi is less execution time than the FSA-K. HGA’s execution time is determined by computing 2ri . The weights of all vectors are growing at a much faster rate than the other two algorithms. normalized and a joint probability distribution is created as a In the HGA, all sets of size k+1 must be generated from a result. set of size n; thus, its execution time is expected to grow at a faster rate as n increases than the execution time of the EGA B. Experiment Results and FSA-K. Each algorithm’s execution time is growing at an The graphs in this section display the results of different exponential rate because the program requires each algorithm experiments conducted to compare either the accuracy or to calculate entropy from a probability table whose size is efficiency of the algorithms. The algorithm EGA refers to doubling for every increase in n. If the empirical distribution the Edgewise Greedy Algorithm without the local search P ∗ is given in the form of a training data set instead of procedure, and the EGA+L refers to the Edgewise Greedy a probability distribution table, then this exponential growth Algorithm with the local search procedure. Comparing these with it n will not be observed. Figure 3 compares the two algorithms provide information on the effects of the local efficiency of Edgewise Greedy Algorithm without the local search. search procedure, EGA, and the Edgewise Greedy Algorithm Fig. 5. Fig. 3.

Fig. 6.

Fig. 4. is significantly less efficient than EGA. The EGA+L is both more accurate and more efficient than the HGA. with the local search procedure, EGA+L. k is 7 and i is V. CONCLUSION 4. The data shows that the local search procedure improved We presented experimental studies comparing greedy al- the accuracy of the EGA, and that it is more effective for gorithms for learning decomposable Markov networks of higher treewidths than for lower treewidths. This increase in treewidth k. The algorithms compared are Malvestuto’s Hyper- accuracy comes with a cost in efficiency. Figure 4 shows that edge Greedy Algorithm [16], the Edgewise Greedy algorithm the EGA performed slower. The rate, however, at which both [10] by the authors, the FSA-K algorithm derived from the algorithms’ execution time grows with increases n appears to stepwise Forward Selection algorithm (FSA) by Deshpande, be the same: the local search procedure appears to increase Garofalakis, and Jordan [9], and EGA with local search. the execution time of the EGA on average by 50%. The experiments conducted show how each algorithm per- Figure 5 compares the accuracy of the EGA+L and the forms with respect to accuracy measured by the KL-divergence HGA. n is 8 and i is 100. The accuracy of the two appear very and to time. The accuracy of the EGA, HGA, and FSA- similar, but the table shows that EGA+L performed slightly K are all similar. The EGA is the most efficient algorithm, better. The advantage of EGA+L is conveyed in Figure 6 with comparable and better accuracies. HGA is the least which shows the frequency that each algorithm generated a efficient algorithm. HGA is on average more accurate than DMN with better accuracy than the other algorithm. It is clear the other two algorithms for treewidths higher than half the that the EGA+L will more likely generate a model with better number of random variables, but it is less accurate otherwise. accuracy especially when the treewidth is low. In addition, The EGA with the local search procedure produces the best the EGA+L requires less time to execute on average than the approximation accuracy overall and requires on average about HGA; Figure 4 shows that the EGA+L requires about 50% 50% more execution time than the EGA without the local more execution time than EGA, and Figure 2 shows that HGA search procedure. One limitation of the studies reported here is that exper- [20] D. J. Rose (1974). On simple characterizations of k-trees. Discrete iments only provide information for a randomly generated Mathematics 7, 317–322. [21] Y. Xiang, S. K. M. Wong, and N. Cercone (1997). A “microscopic” probability distribution. It would be interesting to compare the study of minimum entropy search in learning decomposable Markov algorithms for more probability distributions from meaningful networks. Machine Learning 26(1), 65–92. real world applications. Moreover, some algorithms may per- form better under different conditions. Acknowledgements: This research was partially supported by NSF grant ITR-0326387 and AFOSR grants FA9550-05-1- 0454, F49620-03-1-0238, F49620-03-1-0239 and F49620-03- 1-0241.

REFERENCES

[1] S. M. Altmueller and R. M. Haralick (2004), Approximating high dimen- sional probability distributions. In Proceedings of the 17th International Conference on Pattern Recognition (ICPR’04), 299–302. [2] S. M. Altmueller and R. M. Haralick (2004b), Practical aspects of efficient forward selection in decomposable graphical models. In Pro- ceedings of 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI’04), 710–715. [3] F. R. Bach and M. I. Jordan (2002). Thin junction trees. In Advances in Neural Information Processing Systems (NIPS 2001) Vol. 14, 569–576. [4] J. R. S. Blair and B. W. Peyton (1993). An introduction to chordal graphs and clique trees. In and sparse matrix computation, edited by A. George, J. R. Gilbert, and J. W. H. Liu. The IMA Volumes in Mathematics and its Applications, Vol. 56, 1–29. New York: Springer. [5] H. L. Bodlaender (1998). A partial k-arboretum of graphs with bounded treewidth. Theor. Comp. Sci. 209, 1–45. [6] C. Chow and C. Liu (1968). Approximating discrete probability distri- butions with dependence trees. IEEE Trans. Inform. Theory IT–14(3), 462–467. [7] T. M. Cover and J. A. Thomas (1991). Elements of Information Theory. New York: John Wiley & Sons. [8] G. T. Deane (2002). Graphical models for high dimensional density estimation. B. S. Honours thesis, Australian National University. citeseer.ist.psu.edu/562940.html [9] A. Deshpande, M. Garofalakis, and M. I. Jordan (2001). Efficient stepwise selection in decomposable models. In Proceedings of the 17th Annual Conference on Uncertainty in Artificial Intelligence (UAI-01), 128–135. San Francisco, CA: Morgan Kaufmann. [10] G. Ding, R. Lax, J. Chen, P. Chen, and B. Marx (2007). Comparison of Greedy Strategies for Learning Markov Networks of Tree Width k. In Proceedings of the 2007 Int. Conference on Machine Learning: Models, Technologies and Applications (MLATA07), Las Vegas, 294-300. [11] P. Galinier, M. Habib, and C. Paul (1995). Chordal graphs and their clique graphs. In Graph-Theoretic Concepts in Computer Science, 21st International Workshop, WG ’95, Aachen, Germany, 358–371. Berlin: Springer. [12] P. Giudici and P. J. Green (1999). Decomposable graphical Gaussian model determination. Biometrika 86(4), 785–801. [13] D. Karger and N. Srebro (2001). Learning Markov networks: maximum bounded tree-width graphs. In Proceedings of the 12th annual ACM- SIAM Symposium on discrete Algorithms, 2001, 391–401. [14] S. L. Lauritzen (1996). Graphical Models. Oxford: Oxford University Press. [15] P. Liang and N. Srebro (2004). Methods and experiments with bounded tree-width Markov networks. MIT technical report. ftp://publications.ai.mit.edu/ai-publications/ 2004/AIM-2004-030.pdf [16] F. Malvestuto (1991). Approximating discrete probability distributions with decomposable models. IEEE Transactions on Systems, Man, and Cybernetics 21(5), 1287–1294. [17] M. Narasimhan and J. Bilmes (2004). PAC-learning bounded tree-width graphical models. In Proceedings of the 20th conference on Uncertainty in Artificial Intelligence (UAI-04), 410–417. ACM International Confer- ence Proceeding Series; Vol. 70. [18] M. Norusisˇ (2004). SPSS 13.0 Advanced Statistical Procedures Com- panion. Upper Saddle River, NJ: Prentice Hall. [19] J. Pearl (1988). Probabilistic Reasoning in Intelligent Systems. San Mateo, CA: Morgan Kaufmann.