Learning Markov Network Structure with Decision Trees
Total Page:16
File Type:pdf, Size:1020Kb
Learning Markov Network Structure with Decision Trees Daniel Lowd∗ and Jesse Davisy ∗Department of Computer and Information Science University of Oregon, Eugene, OR 97403 Email: [email protected] y Department of Computer Science, Katholieke Universiteit Leuven Celestijnenlaan 200A, POBox 2402, 3001 Heverlee, Belgium Email: [email protected] Abstract—Traditional Markov network structure learning is limited to modeling pairwise interactions, ignoring all algorithms perform a search for globally useful features. higher-order effects. Furthermore, it still exhibits long run However, these algorithms are often slow and prone to finding times for domains that have large numbers of variables. local optima due to the large space of possible structures. Ravikumar et al. [1] recently proposed the alternative idea In this paper, we propose DTSL (Decision Tree Structure of applying L1 logistic regression to learn a set of pairwise Learner), which builds on the approach of Ravikumar et al. features for each variable, which are then combined into by substituting a probabilistic decision tree learner for L1 a global model. This paper presents the DTSL algorithm, logistic regression. Probabilistic decision trees can represent which uses probabilistic decision trees as the local model. Our much richer structures that model interactions among large approach has two significant advantages: it is more efficient, and it is able to discover features that capture more complex sets of variables. DTSL learns probabilistic decision trees to interactions among the variables. Our approach can also be predict the value of each variable and then converts the trees seen as a method for converting a dependency network into into sets of conjunctive features. We propose and evaluate a consistent probabilistic model. In an extensive empirical several different methods for performing the conversion. evaluation on 13 datasets, our algorithm obtains comparable Finally, DTSL merges all learned features into a global accuracy to three standard structure learning algorithms while running 1-4 orders of magnitude faster. model. Weights for these features can be learned using any standard Markov network weight learning method. Keywords-Markov networks; structure learning; decision We conducted an extensive empirical evaluation on 13 trees; probabilistic methods real-world datasets. We found that DTSL is 1-4 orders of magnitude faster than alternative structure learning algo- I. INTRODUCTION rithms while still achieving equivalent accuracy. Markov networks are an undirected, probabilistic graphi- The remainder of our paper is organized as follows. Sec- cal model for compactly representing a joint probability dis- tion 2 provides background on Markov networks. Section 3 tribution over set of variables. Traditional Markov network describes our method for learning Markov networks using structure learning algorithms perform a global search to learn decision trees. Section 4 presents the experimental results a set of features that accurately captures high-probability and analysis and Section 5 contains conclusions and future regions of the instance space. These approaches are often work. slow for two reasons. First, the space of possible structures II. MARKOV NETWORKS is exponential in the number of variables. Second, evaluating candidate structures requires assigning a weight to each A. Representation feature in the model. Weight learning requires performing A Markov network is a model for the joint distribution of inference over the model, which is often intractable. a set of variables X = (X1;X2;:::;Xn) [2]. It is composed Recently, Ravikumar et al. [1] proposed the alternative of an undirected graph G and a set of potential functions φk. idea of learning a local model for each variable and then The graph has a node for each variable, and the model has combining these models into a global model. Their method a potential function for each clique in the graph. The joint builds an L1 logistic regression model to predict the value distribution represented by a Markov network is: of each variable in terms of all other variables. Next, it 1 Y P (X =x) = φ (x ) (1) constructs a pairwise feature between the target variable and Z k fkg each other variable with non-zero weight in the L1 logistic k regression model. Finally, it adds all constructed features where xfkg is the state of the kth clique (i.e., the state of the to the model and learns their associated weights using any variables that appear in that clique), and Z is a normalization standard weight learning algorithm. While this approach constant called the partition function. Markov networks are greatly improves the tractability of structure learning, it often conveniently represented as log-linear models, with each clique potential replaced by an exponentiated weighted blanket in the data. Pseudo-likelihood and its gradient can sum of features of the state: be computed efficiently and optimized using any standard 0 1 convex optimization algorithm, since the pseudo-likelihood 1 X P (X =x) = exp w f (x) (2) of a Markov network is also convex. Z @ j j A j D. Structure Learning A feature fj(x) may be any real-valued function of the state. Della Pietra et al.’s algorithm [2] is the standard approach For discrete data, a feature typically is a conjunction of tests to learning the structure of a Markov network. The algorithm of the form Xi = xi, where Xi is a variable and xi is a value starts with a set of atomic features (i.e., just the variables of that variable. We say that a feature matches an example in the domain). It creates candidate features by conjoining if it is true of that example. each feature to each other feature, including the original B. Inference atomic features. It calculates the weight for each candidate feature by assuming that all other feature weights remain The main inference task in graphical models is to compute unchanged, which is done for efficiency reasons. It uses the conditional probability of some variables (the query) Gibbs sampling for inference when setting the weight. Then, given the values of some others (the evidence), by summing it evaluates each candidate feature f by estimating how out the remaining variables. This problem is #P-complete. much adding f would increase the log-likelihood. It adds the Thus, approximate inference techniques are required. One feature that results in the largest gain to the feature set. This widely used method is Markov chain Monte Carlo (MCMC) procedure terminates when no candidate feature improves [3], and in particular Gibbs sampling, which proceeds by the model’s score. sampling each variable in turn given its Markov blanket, the Recently, Davis and Domingos [6] proposed an alternative variables it appears with in some potential. These samples bottom-up approach, called BLM, for learning the structure can be used to answer probabilistic queries by counting the of a Markov network. BLM starts by treating each complete number of samples that satisfy each query and dividing by example as a long feature in the Markov network. The algo- the total number of samples. Under modest assumptions, rithm repeatedly iterates through the feature set. It considers the distribution represented by these samples will eventually generalizing each feature to match its k nearest previously converge to the true distribution. However, convergence unmatched examples by dropping variables. If incorporating may require a very large number of samples, and detecting the newly generalized feature improves the model’s score, convergence is difficult. it is retained in the model. The process terminates when no C. Weight Learning generalization improves the score. A recent L1 regularization based approach to structure The goal of weight learning is to select feature weights learning is the method of Ravikumar et al. [1]. It learns that maximize a given objective function. One of the most the structure by trying to discover the Markov blanket of popular objective functions is the log likelihood of the each variable (i.e., its neighbors in the network). It considers training data. In a Markov network, log likelihood is a each variable X in turn and builds an L1 logistic regression convex function of the weights, and thus weight learning i model to predict the value of X given the remaining can be posed as a convex optimization problem. However, i variables. L1 regularization encourages sparsity, so that most this optimization typically requires evaluating the log like- of the variables end up with a weight of zero. The Markov lihood and its gradient in each iteration. This is typically blanket of X is all variables that have non-zero weight in intractable to compute exactly due to the partition function. i the logistic regression model. In the limit of infinite data, Furthermore, an approximation may work poorly: Kulesza consistency is guaranteed (i.e., X is in X ’s Markov blanket and Pereira [4] have shown that approximate inference can i j if and only if X is in X ’s Markov blanket). In practice, this mislead weight learning algorithms. j i is often not the case and there are two methods to decide A more efficient alternative, widely used in areas such which edges to include in the network. One includes an as spatial statistics, social network modeling and language edge if either X is in X ’s Markov blanket or X is in processing, is to optimize the pseudo-likelihood [5] instead. i j j X ’s Markov blanket. The other includes an edge if both Pseudo-likelihood is the product of the conditional proba- i X is in X ’s Markov blanket and X is in X ’s Markov bilities of each variable given its Markov blanket: i j j i blanket. Finally, all features are added to the model and • log Pw(X =x) = their weights are learned globally using any standard weight PV PN learning algorithm.