Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures

Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures Tomi Silander1 Janne Leppä-aho2;4 Elias Jääsaari2;3 Teemu Roos2;4 1NAVER LABS Europe 2Helsinki Institute for 3Department of CS, 4Department of CS, Information Technology HIIT Aalto University University of Helsinki Abstract model selection criteria such as the Akaike information criterion (AIC) [Akaike, 1973] and the Bayesian information criterion (BIC) [Schwarz, 1978] have also We introduce an information theoretic crite- been used for the task, but recent comparisons have rion for Bayesian network structure learning not been favorable for AIC, and BIC appears to re- which we call quotient normalized maximum quire large sample sizes in order to identify appropri- likelihood (qNML). In contrast to the closely ate structures [Silander et al., 2008, Liu et al., 2012]. related factorized normalized maximum like- Traditionally, the most popular criterion has been lihood criterion, qNML satisfies the property the Bayesian marginal likelihood [Heckerman, 1995] of score equivalence. It is also decompos- and its BDeu variant (see Section 2), but stud- able and completely free of adjustable hy- ies [Silander et al., 2007, Steck, 2008] show this crite- perparameters. For practical computations, rion to be sensitive to hyperparameters and to yield we identify a remarkably accurate approxi- undesirably complex models for small sample sizes. mation proposed earlier by Szpankowski and Weinberger. Experiments on both simulated The information-theoretic normalized maxi- and real data demonstrate that the new crite- mum likelihood (NML) criterion [Shtarkov, 1987, rion leads to parsimonious models with good Rissanen, 1996] would otherwise be a potential can- predictive accuracy. didate for a good criterion, but its exact calculation is likely to be prohibitively expensive. In 2008, Silander et al. introduced a hyperparameter-free, 1 INTRODUCTION NML inspired criterion called the factorized NML (fNML) [Silander et al., 2008] that was shown to yield good predictive models without such sensitivity Bayesian networks [Pearl, 1988] are popular mod- problems. However, from the structure learning point els for presenting multivariate statistical dependen- of view, fNML still sometimes appears to yield overly cies that may have been induced by underlying complex models. In this paper we introduce another causal mechanisms. Techniques for learning the NML related criterion, the quotient NML (qNML) structure of Bayesian networks from observational that yields simpler models without sacrificing predic- data have therefore been used for many tasks such tive accuracy. Furthermore, unlike fNML, qNML is as discovering cell signaling pathways from pro- score equivalent, i.e., it gives equal scores to structures tein activity data [Sachs et al., 2002], revealing the that encode the same independence and dependence business process structures from transaction logs statements. Like other common model selection [Savickas and Vasilecas, 2014] and modeling brain- criteria, qNML is also consistent. region connectivity using fMRI data [Ide et al., 2014]. We next briefly introduce Bayesian networks and re- Learning the structure of statistical dependencies can view the BDeu and fNML criteria and then introduce be seen as a model selection task where each model the qNML criterion. We also summarize the results for is a different hypothesis about the conditional de- 20 data sets to back up our claim that qNML yields pendencies between sets of variables. Traditional parsimonious models with good predictive capabilities. st The experiments with artificial data generated from Proceedings of the 21 International Conference on Ar- real-world Bayesian networks demonstrate the capa- tificial Intelligence and Statistics (AISTATS) 2018, Lan- zarote, Spain. PMLR: Volume 84. Copyright 2018 by the bility of our score to quickly learn a structure close to author(s). the generating one. Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures 2 BAYESIAN NETWORKS BreastC BDeu 25000 Bayesian networks are a general way to describe BIC fNML the dependencies between the components of an 20000 n-dimensional random data vector. In this paper we qNML 15000 only address the case in which the component Xi of the data vector X = (X1;:::;Xn) may take any of 10000 the discrete values in a set f1; : : : ; rig. Despite denot- ing the values with small integers, the model will treat 5000 average number of parameters each Xi as a categorical variable. 0 50 100 150 200 250 Sample size 2.1 Likelihood A Bayesian network B = (G; θ) defines a probabil- Figure 1: Number of parameters in a breast cancer ity distribution for X. The component G defines the model as a function of sample size for different model structure of the model as a directed acyclic graph selection criteria. (DAG) that has exactly one node for each component of X. The structure G = (G1;:::;Gn) defines for each P (DjG; α). If the model parameters θij are further as- variable/node Xi its (possibly empty) parent set Gi, sumed to be independently Dirichlet distributed only i.e., the nodes from which there is a directed edge to depending on i and Gi and the data D is assumed to the variable Xi. have no missing values, the marginal likelihood can be Given a realization x of X, we denote the sub-vector of decomposed as x that consists of the values of the parents of Xi in x P (DjG; α) by Gi(x). It is customary to enumerate all the possible n qi sub-vectors G (x) from 1 to q = Q r : In case G Y Y i i h2Gi h i = P (Di;Gi=j; α) is empty, we define qi = 1 and P (Gi(x) = 1) = 1 for i=1 j=1 all vectors x. n qi Y Y Z = P (Di;Gi=jjθij)P (θij; α)dθij: (2) For each variable Xi there is a qi ×ri table θi of param- th th i=1 j=1 eters whose k column on the j row θij defines the conditional probability P (Xi = k j Gi(X) = j; θ) = In coding terms this means that each data column Di θijk. With structure G and parameters θ, we can now is first partitioned based on the values in columns Gi, express the likelihood function of the model as and each part is then coded using a Bayesian mixture that can be expressed in closed form [Buntine, 1991, n n Y Y Heckerman et al., 1995]. P (xjG; θ) = P (xi j Gi(x); θi) = θiGi(x)xi : (1) i=1 i=1 2.3 Problems, Solutions and Problems 2.2 Bayesian Structure Learning Finding satisfactory Dirichlet hyperparameters for the Bayesian mixture above has, however, turned out to Score-based Bayesian learning of Bayesian network be problematic. Early on, one of the desiderata for structures evaluates the goodness of different struc- a good model selection criterion was that it is score tures G using their posterior probability P (GjD; α), equivalent, i.e., it would yield equal scores for essen- where α denotes the hyperparameters for the model tially equivalent models [Verma and Pearl, 1991]. For parameters θ, and D is a collection of N n-dimensional example, the score for the structure X ! X should i.i.d. data vectors collected to a N × n design matrix. 1 2 be the same as the score for the model X ! X We use the notation D to denote the ith column of 2 1 i since they both correspond to the hypothesis that vari- the data matrix and the notation D to denote the V ables X and X are statistically dependent on each columns that correspond to the variable subset V . We 1 2 other. It can be shown [Heckerman et al., 1995] that also write D for D and denote the entries i;Gi fig[Gi to achieve this, not all the hyperparameters α are pos- of the column i on the rows on which the parents G i sible and for practical reasons Buntine [Buntine, 1991] contain the value configuration number j by D , i;Gi=j suggested a so-called BDeu score with just one hyper- j 2 f1; : : : ; qig. α α parameter α 2 R++ so that θij· ∼ Dir( ;:::; ). qiri qiri It is common to assume the uniform prior for struc- However, it soon turned out that the BDeu score tures, in which case the objective function for struc- was very sensitive to the selection of this hyper- ture learning is reduced to the marginal likelihood parameter [Silander et al., 2007] and that for small Tomi Silander, Janne Leppä-aho,Elias Jääsaari,Teemu Roos sample sizes this method detects spurious correla- titions, i.e., changing the Bayesian mixture in equa- tions [Steck, 2008] leading to models with suspiciously tion (2) to many parameters. ^ 1 P (Djθ(Di;Gi=j; G)) Recently, Suzuki [Suzuki, 2017] discussed the theoret- PNML(Di;Gi=j; G) = ; (4) P P (D0jθ^(D0; G)) ical properties of the BDeu score and showed that in D0 certain settings BDeu has the tendency to add more 0 jDi;G =j j where D 2 f1; : : : ; rig i . The logarithm of the and more parent variables for a child node even though denominator is often called the regret, since it indi- the empirical conditional entropy of the child given the cates the extra code length needed compared to the parents has already reached zero. In more detail, as- code length obtained using the (a priori unknown) sume that in our data D the values of Xi are com- 1 maximum likelihood parameters. The regret for PNML pletely determined by variables in a set Z, so that depends only on the length N of the categorical data the empirical entropy HN (XijZ) is zero. Now, if we vector with r different categorical values, can further find one or more variables, denoted by Y , X whose values are determined completely by the vari- reg(N; r) = log P (Djθ^(D)): (5) ables in Z, then BDeu will prefer the set Z [ Y over D2f1;:::;rgN Z alone as the parents of Xi.

Load more