Bayesian Structure Learning for Markov Random Fields with a Spike and Slab Prior

Bayesian Structure Learning for Markov Random Fields with a Spike and Slab Prior Yutian Chen Max Welling Department of Computer Science Department of Computer Science University of California, Irvine University of California, Irvine Irvine, CA 92697 Irvine, CA 92697 Abstract ble to measure a multitude of data-attributes. There is also an increasing interest in sparse model structures because In recent years a number of methods have they help against overfitting and are computationally more been developed for automatically learning the tractable than dense model structures. (sparse) connectivity structure of Markov Ran- In this paper we focus on a particular type of MRF, called dom Fields. These methods are mostly based on a log-linear model, where structure learning or feature se- L1-regularized optimization which has a num- lection is integrated with parameter estimation. Although ber of disadvantages such as the inability to structure learning has been extensively studied for directed assess model uncertainty and expensive cross- graphical models, it is typically more difficult for undi- validation to find the optimal regularization pa- rected models due to the intractable normalization term of rameter. Moreover, the model’s predictive per- the probability distribution, known as the partition func- formance may degrade dramatically with a sub- tion. Traditional algorithms apply only to restricted types optimal value of the regularization parameter of structures with low tree-width (Andrew and Gao, 2007; (which is sometimes desirable to induce sparse- Tsuruoka et al., 2009; Hu et al., 2009) or special models ness). We propose a fully Bayesian approach such as Gaussian graphical models (Jones et al., 2005) so based on a “spike and slab” prior (similar to that accurate inference can be conducted efficiently. L0 regularization) that does not suffer from these shortcomings. We develop an approxi- For an arbitrary structure, various methods have been pro- mate MCMC method combining Langevin dy- posed in the literature, generally categorized into two ap- namics and reversible jump MCMC to conduct proaches. One approach is based on separate tests on an inference in this model. Experiments show that edge or the neighbourhood of a node so that there is no need the proposed model learns a good combination to compute the joint distribution (Wainwright et al., 2007; of the structure and parameter values without Bresler et al., 2008; Ravikumar et al., 2010). The other ap- the need for separate hyper-parameter tuning. proach is based on maximum likelihood estimation (MLE) Moreover, the model’s predictive performance with a sparsity inducing criterion. These methods require approximate inference algorithms in order to estimate the is much more robust than L1-based methods with hyper-parameter settings that induce highly log-likelihood such as Gibbs sampling (Della Pietra et al., sparse model structures. 1997), loopy belief propagation (Lee et al., 2006; Zhu et al., 2010), or pseudo-likelihood (Hofling¨ and Tibshirani, 2009). A popular choice of such a criterion is L1 regular- 1 Introduction ization (Riezler and Vasserman, 2004; Dudik et al., 2004) which enjoys several good properties such as a convex ob- jective function and a consistency guarantee. However, L - Undirected probabilistic graphical models, also known as 1 regularized MLE is usually sensitive to the choice the reg- Markov Random Fields (MRFs), have been widely used ularization strength, and these optimization-based methods in a large variety of domains including computer vision cannot provide a credible interval for the learned structure. (Li, 2009), natural language processing (Sha and Pereira, Also, in order to learn a sparse structure, a strong penalty 2003), and social networks (Robins et al., 2007). The struc- has to be imposed on all the edges which usually results in ture of the model is defined through a set of features defined suboptimal parameter values. on subsets of random variables. Automated methods to se- lect relevant features are becoming increasingly important We will follow a third approach to MRF structure learn- in a time where the proliferation of sensors make it possi- ing in a fully Bayesian framework which has not been explored yet. The Bayesian approach considers the structure MRFs with log-linear parametrization: of a graphical model as random. Inference in a Bayesian ! 1 X model provides inherent regularization, and offers a fully P (xjθ) = exp θ f (x ) (1) Z(θ) α α α probabilistic characterization of the underlying structure. α It was shown in Park and Casella (2008) that Bayesian where each potential function is defined as the exponential models with a Gaussian or Laplace prior distribution (cor- of the product between a feature function f of a set of responding to L or L regularization) do not exhibit α 2 1 variables x and an associated parameter θ . Z is called sparse structure. Mohamed et al. (2011) proposes to use a α α the partition function. All the variables in the scope of a “spike and slab” prior for learning directed graphical mod- potential function form a clique in their graphical represen- els which corresponds to the ideal L regularization. This 0 tation. When a parameter θ has a value of zero, we could model exhibits better robustness against over-fitting than α equivalently remove feature f and all the edges between the related L approaches. Unlike the Laplace/Gaussian α 1 variables in x (if these variables are not also in the scope prior, the posterior distribution over parameters for a “spike α of other features) without changing the distribution of x. and slab” prior is no longer guaranteed to be unimodal. Therefore, by learning the parameters of this MRF model However, approximate inference methods have been suc- we can simultaneously learn the structure of a model if we cessfully applied in the context of directed models using allow some parameters to go to zero. MCMC (Mohamed et al., 2011) and expectation propagation (Hernandez-Lobato´ et al., 2010). The Bayesian learning approach to graphical models considers parameters as a random variable subject to a prior. Unfortunately, Bayesian inference for MRFs is much Given observed data, we can infer the posterior distribu- harder than for directed networks due to the partition tion of the parameters and their connectivity structure. Two function. This feature renders even MCMC sampling in- commonly used priors, the Laplace and the Gaussian distri- tractable which caused some people to dub these prob- bution, correspond to the L and L penalties respectively lems “double intractability” (Murray et al., 2006). Nev- 1 2 in the associated optimization-based approach. Although ertheless, variational methods (Parise and Welling, 2006; a model learned by L -penalized MLE is able to obtain a Qi et al., 2005) and MCMC methods (Murray and Ghahra- 1 sparse structure, the full Bayesian treatment usually results mani, 2004) have been successfully explored for approxi- in a fully connected model with many weak edges as ob- mate inference when the model structure is fixed. served in Park and Casella (2008), without special approx- We propose a Bayesian structure learning method with a imate assumptions like the ones in Lin and Lee (2006). We spike and slab prior for MRFs and devise an approximate propose to use the “spike and slab” prior to learn a sparse MCMC method to draw samples of both the model struc- structure for MRFs in a fully Bayesian approach. The spike ture and the model parameters by combining a modified and slab prior (Mitchell and Beauchamp, 1988; Ishwaran Langevin sampler with a reversible jump MCMC method. and Rao, 2005) is a mixture distribution which consists of Experiments show that the posterior distribution estimated a point mass at zero (spike) and a widely spread distribution by our inference method matches the actual distribution (slab): very well. Moreover, our method offers better robustness 2 P (θα) = (1 − p0)δ(θα) + p0N (θα; 0; σ ) (2) to both under-fitting and over-fitting than L1-regularized 0 methods. A related but different application of the spike where p0 2 [0; 1], δ is the Dirac delta function, and σ0 is and slab distribution in MRFs is shown in Courville et al. usually large enough to be uninformative. The spike com- (2011) for modelling hidden random variables. ponent controls the sparsity of the structure in the poste- This paper is organized as follows: we first introduce a hi- rior distribution while the slab component usually applies erarchical Bayesian model for MRF structure learning in a mild shrinkage effect on the parameters of the existing section 2 and then describe an approximate MCMC method edges even in a highly sparse model. This type of selective in section 3, 4, and 5 to draw samples for the model param- shrinkage is different from the global shrinkage imposed eters, structure, and other hyper-parameters respectively. by L1=L2 regularization, and enjoys benefits in parameter Experiments are conducted in section 6 on two simulated estimation as demonstrated in the experiment section. data sets and a real-world dataset, followed by a discussion The Bayesian MRF with the spike and slab prior is formu- section. lated as follows: ! 1 X P (xjθ) = exp θ f (x ) 2 Learning the Structure of MRFs as Z(θ) α α α α Bayesian Inference θα = YαAα Yα ∼ Bern(p0) p0 ∼ Beta(a; b) The probability distribution of a MRF is defined by a set 2 −2 of potential functions. Consider a widely used family of Aα ∼ N (0; σ0) σ0 ∼ Γ(c; d) (3) ing samples of Aα with fixed Y in this section and will use θα and Aα interchangeably to refer to a nonzero parameter. Even for an MRF model with a fixed structure, MCMC is still intractable. Approximate MCMC methods have been discussed in Murray and Ghahramani (2004) among which Langevin Monte Carlo (LMC) with “brief sampling” to compute the required expectations in the gradients, shows good performance.

Load more