Fast Bayesian Network Structure Search Using Gaussian Processes
Blake Anderson and Terran Lane
University of New Mexico, Albuquerque NM 87131, USA
Abstract. In this paper we introduce two novel methods for performing Bayesian network structure search that make use of Gaussian Process re- gression. Using a relatively small number of samples from the posterior distribution of Bayesian networks, we are able to find an accurate func- tion approximator based on Gaussian Processes. This allows us to remove our dependency on the data during the search and leads to massive speed improvements without sacrificing performance. We use our function ap- proximator in the context of Hill Climbing, a local-score based search algorithm, and in the context of a global optimization technique based on response surfaces. We applied our methods to both synthetic and real data. Results show that we converge to networks of equal score to those found by traditional Hill Climbing, but at a fraction of the total time.
1 Introduction
Bayesian networks compactly represent a joint probability distribution by ex- ploiting a set of independence assumptions between the set of random variables. A Bayesian network is a probabilistic graphical model where the random vari- ables are represented as nodes and the conditional dependencies of the random variables are represented as edges. They are frequently used in bioinformatics, image processing, medical applications, and many other problem domains [1]. An important problem associated with Bayesian networks is model selection: given a dataset, find a structure that maximizes the posterior probability with re- spect to the data. Unfortunately, the space of all structures is super-exponential in the number of random variables and finding the optimal structure is known to be NP-complete [2]. Many methods have been devised to overcome the obstacle of searching through such a large space. These methods usually employ greedy methods to cut down on the size of the space [3] and rely on the decomposability of local scoring metrics such as the BDeu score [4]. With decomposable metrics, the chosen modification can be evaluated separately from the rest of the network. In traditional score-based search algorithms, the evaluation of the modification still requires looking at the entire dataset. Even with clever caching techniques, as the size of the dataset increases, the overhead incurred by scanning the data becomes very expensive. We present two novel solutions to this problem. The first dramatically speeds up current score-based search algorithms without sacrificing performance by us- ing a Gaussian Process to approximate the search landscape. With Gaussian Process regression, we aim to approximate a function, f : G → R, capable of computing reasonable BDeu scores given the structure of the Bayesian network, G. We begin constructing our function by sampling a constant number of struc- tures from the super-exponential structure space. We calculate the exact BDeu score for each structure of this sample, giving us a set of structure/value pairs. We then use a Gaussian Process regressor with a graph kernel to learn a model of the function from structures to scores. Given this model, we can discard the training data during the iterations of the search algorithm. Our Gaussian Pro- cess based function approximator can easily be used in a wide variety of search algorithms to speed up the process of determining which step in the space to take. We illustrate this point by implementing the Hill Climbing algorithm [5] using our method and showing the corresponding speed-ups.
Our second approach is related to the response surface framework from op- timization theory [6], in which one draws a sample of points from the objective function and then fits a regression model to that sample. Then one optimizes the response surface, which is typically more tractable to work with than the objective function. Using the mean-squared error of the response surface, we can locate the point on the surface that has the highest probability of having a better score than the current minimum Bayesian network score. We then find the true score of this point, update the response surface, and iterate. Response surfaces allow us to quickly find a global optimum using a sparse number of samples.
The use of a function approximator to model a Bayesian network scoring metric has been recently explored [7]. There, the authors construct a metagraph of Bayesian networks, and then use the eigenvectors of the metagraph to in- terpolate the scores of unknown networks, given the scores of a small number of sampled points. However, Yackley et al. show only that they can accurately approximate the true scoring function; they do not actually apply their function approximator to search. We go beyond that work by demonstrating the use of such approximators to dramatically speed up structure search.
There has been little work done to exploit Gaussian Processes to improve the different aspects of Bayesian networks. Gaussian Process networks is the only true prior instance of combining these two fields that we are aware of [8]. Most Bayesian networks based on continuous variables use Gaussian distributions to model continuous random variables. Gaussian Process networks were developed to cope with the problem that these networks can only learn linear dependencies in the data. A Gaussian Process prior was developed for the random variables which had the advantage of being able to learn functional dependencies between the data. The prior also allowed the marginal probability of the data to be computed in closed form. 2 Background
2.1 Bayesian Network Structure Search
Let U = {X1,...,Xn} be a set of random variables. A Bayesian network is a probabilistic graphical model that compactly encodes the joint probability dis- tribution among these random variables. Let B = hG,Θi be a Bayesian network described by graph, G, and conditional probability distributions, Θ. G is a di- rected acyclic graph where the nodes represent the random variables of U and the edges represent statistical dependencies among the nodes. We can compactly represent the joint probability distribution because each node, Xi, is statistically independent of all other nodes in the graph given its parent set, Pa(Xi), with respect to G. Θi is the conditional probability distribution for Xi, which encodes the node’s dependence on its parents. Given the independence assumption, we can write the joint probability distribution as
n Y PB(X1,...,Xn|Θi) = PB(Xi|Pa(Xi),Θi) (1) i=1 We are interested in learning a B = hG,Θi that maximizes some score of the structure, given a dataset D = {x1,..., xn}, where each x is a vector of the values for Xi in the dataset. Typical scoring functions include BDeu [4], BIC [9], and MDL [10]. For simplicity, in this paper we use BDeu, though the basic framework here is independent of the scoring function. Due to the independence assumptions of Bayesian networks, certain scoring functions are decomposable, meaning that the score of one family can be com- puted independently from the scores of the rest of the network. This property is used extensively by search algorithms, as they are able to compute the score of each change they consider efficiently. Structure learning is generally accom- plished using a decomposable scoring function, such as BDeu, and local search strategies. Perhaps the simplest local search strategy is Hill Climbing. This method starts with an empty graph, and then tries to find the one-step change, adding, reversing, or deleting an edge, that increases the BDeu score of the Bayesian net- work the most. There are several related approaches that improve upon this sim- ple idea. These include Repeated Hill Climbing, which just has random restarts, and Look-Ahead Hill Climbing, which chooses a set of the best changes and tries to avoid local minima by looking ahead a certain number of steps to determine which is the best choice.
2.2 Gaussian Process Regression A Gaussian Process represents a collection of random variables, in which any finite subset has a joint Gaussian distribution [11]. The entire Gaussian Process can be thought of as an infinite dimensional Gaussian
N (µ, K) (2) where µ is the mean function and K is the covariance function. The mean function is typically taken to be zero without loss of generality. The squared exponential is a typical covariance function for real-valued vectors 0 2 1 0 2 2 K(x, x ) = σ exp − |x − x | + σ δ 0 (3) f 2l2 n xx
2 2 where σn, σf , and l are hyperparameters and δ is the Kronecker delta. The hyperparameters are usually set by maximizing the marginal likelihood through optimization techniques such as gradient descent. Regression begins by building up the covariance matrix with all of the train- ing data, D = {X, f}. X consists of all the instances, x, in the dataset, and f is the vector of labels for each x. The covariance matrix is built such that each entry of the matrix is K(x, x0). Using this information, we can construct the joint probability distribution between our training data f µ KK∗ = N , T (4) f∗ µ∗ K∗ K∗∗ where K is the covariance matrix between datapoints in the training data, K∗∗ is the covariance matrix between datapoints in the testing data, K∗ is the co- variance matrix between the training and testing data, and f∗ is the vector of predicted values. If we condition the joint probability distribution on the observed training data and assume noise-free observations, we have