<<

Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures

Tomi Silander1 Janne Lepp¨a-aho2,4 Elias J¨a¨asaari2,3 Teemu Roos2,4 1NAVER LABS Europe 2Helsinki Institute for 3Department of CS, 4Department of CS, Information Technology HIIT Aalto University University of Helsinki

Abstract model selection criteria such as the Akaike informa- tion criterion (AIC) [Akaike, 1973] and the Bayesian information criterion (BIC) [Schwarz, 1978] have also We introduce an information theoretic crite- been used for the task, but recent comparisons have rion for Bayesian network structure learning not been favorable for AIC, and BIC appears to re- which we call quotient normalized maximum quire large sample sizes in order to identify appropri- likelihood (qNML). In contrast to the closely ate structures [Silander et al., 2008, Liu et al., 2012]. related factorized normalized maximum like- Traditionally, the most popular criterion has been lihood criterion, qNML satisfies the property the Bayesian marginal likelihood [Heckerman, 1995] of score equivalence. It is also decompos- and its BDeu variant (see Section 2), but stud- able and completely free of adjustable hy- ies [Silander et al., 2007, Steck, 2008] show this crite- perparameters. For practical computations, rion to be sensitive to and to yield we identify a remarkably accurate approxi- undesirably complex models for small sample sizes. mation proposed earlier by Szpankowski and Weinberger. on both simulated The information-theoretic normalized maxi- and real data demonstrate that the new crite- mum likelihood (NML) criterion [Shtarkov, 1987, rion leads to parsimonious models with good Rissanen, 1996] would otherwise be a potential can- predictive accuracy. didate for a good criterion, but its exact calculation is likely to be prohibitively expensive. In 2008, Silander et al. introduced a -free, 1 INTRODUCTION NML inspired criterion called the factorized NML (fNML) [Silander et al., 2008] that was shown to yield good predictive models without such sensitivity Bayesian networks [Pearl, 1988] are popular mod- problems. However, from the structure learning point els for presenting multivariate statistical dependen- of view, fNML still sometimes appears to yield overly cies that may have been induced by underlying complex models. In this paper we introduce another causal mechanisms. Techniques for learning the NML related criterion, the quotient NML (qNML) structure of Bayesian networks from observational that yields simpler models without sacrificing predic- data have therefore been used for many tasks such tive accuracy. Furthermore, unlike fNML, qNML is as discovering cell signaling pathways from pro- score equivalent, i.e., it gives equal scores to structures tein activity data [Sachs et al., 2002], revealing the that encode the same independence and dependence business process structures from transaction logs statements. Like other common model selection [Savickas and Vasilecas, 2014] and modeling brain- criteria, qNML is also consistent. region connectivity using fMRI data [Ide et al., 2014]. We next briefly introduce Bayesian networks and re- Learning the structure of statistical dependencies can view the BDeu and fNML criteria and then introduce be seen as a model selection task where each model the qNML criterion. We also summarize the results for is a different hypothesis about the conditional de- 20 data sets to back up our claim that qNML yields pendencies between sets of variables. Traditional parsimonious models with good predictive capabilities.

st The experiments with artificial data generated from Proceedings of the 21 International Conference on Ar- real-world Bayesian networks demonstrate the capa- tificial Intelligence and (AISTATS) 2018, Lan- zarote, Spain. PMLR: Volume 84. Copyright 2018 by the bility of our score to quickly learn a structure close to author(s). the generating one. Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures

2 BAYESIAN NETWORKS BreastC BDeu 25000 Bayesian networks are a general way to describe BIC fNML the dependencies between the components of an 20000 n-dimensional random data vector. In this paper we qNML 15000 only address the case in which the component Xi of the data vector X = (X1,...,Xn) may take any of 10000 the discrete values in a set {1, . . . , ri}. Despite denot- ing the values with small integers, the model will treat 5000 average number of parameters each Xi as a categorical variable. 0 50 100 150 200 250 Sample size 2.1 Likelihood

A Bayesian network B = (G, θ) defines a probabil- Figure 1: Number of parameters in a breast cancer ity distribution for X. The component G defines the model as a function of sample size for different model structure of the model as a selection criteria. (DAG) that has exactly one node for each component of X. The structure G = (G1,...,Gn) defines for each P (D|G, α). If the model parameters θij are further as- variable/node Xi its (possibly empty) parent set Gi, sumed to be independently Dirichlet distributed only i.e., the nodes from which there is a directed edge to depending on i and Gi and the data D is assumed to the variable Xi. have no missing values, the marginal likelihood can be Given a realization x of X, we denote the sub-vector of decomposed as x that consists of the values of the parents of Xi in x P (D|G, α) by Gi(x). It is customary to enumerate all the possible n qi sub-vectors G (x) from 1 to q = Q r . In case G Y Y i i h∈Gi h i = P (Di,Gi=j; α) is empty, we define qi = 1 and P (Gi(x) = 1) = 1 for i=1 j=1 all vectors x. n qi Y Y Z = P (Di,Gi=j|θij)P (θij; α)dθij. (2) For each variable Xi there is a qi ×ri table θi of param- th th i=1 j=1 eters whose k column on the j row θij defines the P (Xi = k | Gi(X) = j; θ) = In coding terms this means that each data column Di θijk. With structure G and parameters θ, we can now is first partitioned based on the values in columns Gi, express the of the model as and each part is then coded using a Bayesian mixture that can be expressed in closed form [Buntine, 1991, n n Y Y Heckerman et al., 1995]. P (x|G, θ) = P (xi | Gi(x); θi) = θiGi(x)xi . (1) i=1 i=1 2.3 Problems, Solutions and Problems

2.2 Bayesian Structure Learning Finding satisfactory Dirichlet hyperparameters for the Bayesian mixture above has, however, turned out to Score-based Bayesian learning of Bayesian network be problematic. Early on, one of the desiderata for structures evaluates the goodness of different struc- a good model selection criterion was that it is score tures G using their P (G|D, α), equivalent, i.e., it would yield equal scores for essen- where α denotes the hyperparameters for the model tially equivalent models [Verma and Pearl, 1991]. For parameters θ, and D is a collection of N n-dimensional example, the score for the structure X → X should i.i.d. data vectors collected to a N × n design matrix. 1 2 be the same as the score for the model X → X We use the notation D to denote the ith column of 2 1 i since they both correspond to the hypothesis that vari- the data matrix and the notation D to denote the V ables X and X are statistically dependent on each columns that correspond to the variable subset V . We 1 2 other. It can be shown [Heckerman et al., 1995] that also write D for D and denote the entries i,Gi {i}∪Gi to achieve this, not all the hyperparameters α are pos- of the column i on the rows on which the parents G i sible and for practical reasons Buntine [Buntine, 1991] contain the value configuration number j by D , i,Gi=j suggested a so-called BDeu score with just one hyper- j ∈ {1, . . . , qi}. α α parameter α ∈ R++ so that θij· ∼ Dir( ,..., ). qiri qiri It is common to assume the uniform prior for struc- However, it soon turned out that the BDeu score tures, in which case the objective function for struc- was very sensitive to the selection of this hyper- ture learning is reduced to the marginal likelihood parameter [Silander et al., 2007] and that for small Tomi Silander, Janne Lepp¨a-aho,Elias J¨a¨asaari,Teemu Roos sample sizes this method detects spurious correla- titions, i.e., changing the Bayesian mixture in equa- tions [Steck, 2008] leading to models with suspiciously tion (2) to many parameters. ˆ 1 P (D|θ(Di,Gi=j; G)) Recently, Suzuki [Suzuki, 2017] discussed the theoret- PNML(Di,Gi=j; G) = , (4) P P (D0|θˆ(D0; G)) ical properties of the BDeu score and showed that in D0 certain settings BDeu has the tendency to add more 0 |Di,G =j | where D ∈ {1, . . . , ri} i . The logarithm of the and more parent variables for a child node even though denominator is often called the regret, since it indi- the empirical conditional entropy of the child given the cates the extra code length needed compared to the parents has already reached zero. In more detail, as- code length obtained using the (a priori unknown) sume that in our data D the values of Xi are com- 1 maximum likelihood parameters. The regret for PNML pletely determined by variables in a set Z, so that depends only on the length N of the categorical data the empirical entropy HN (Xi|Z) is zero. Now, if we vector with r different categorical values, can further find one or more variables, denoted by Y , X whose values are determined completely by the vari- reg(N, r) = log P (D|θˆ(D)). (5) ables in Z, then BDeu will prefer the set Z ∪ Y over D∈{1,...,r}N Z alone as the parents of Xi. Suzuki argues that this kind of behavior violates regularity in model selection While the naive computation of the regret is still as the more complex model is preferred over a simpler prohibitive, Silander et al. approximate it effi- one even though it does not fit the data any better. ciently using a so-called Szpankowski approxima- The phenomenon seems to stem from the way the hy- tion [Kontkanen et al., 2003]: √ perparameters for the are chosen r  1 1 2rΓ 2 in BDeu as using Jeffreys’ prior, θijk ∼ Dir( ,..., ), reg(N, r) ≈ √ (6) 2 2 r−1  does not suffer from this anomaly. However, using Jef- 3 NΓ 2 freys’ prior causes marginal likelihood score not to be r − 1 N   r  1 + log − log Γ + log (π) score equivalent. In Section 3.4, we will give the for- 2 2 2 2 mal definition of regularity and state that qNML is r2Γ2 r  2r3 − 3r2 − 2r + 3 regular. In addition, we provide a proof of regularity − 2 + . 2 r−1  36N for fNML criterion, which has not appeared in the lit- 9NΓ 2 erature before. The detailed proofs can be found in Appendix B in the Supplementary Material. However, equation (6) is derived only for the case where r is constant and N grows. While with fNML it A natural solution to avoid parameter sensitiv- is typical that N is large compared to r, an approxima- ity of BDeu would be to use a normalized max- tion for all ranges of N and r derived by Szpankowski imum likelihood (NML) criterion [Shtarkov, 1987, and Weinberger [Szpankowski and Weinberger, 2012] Rissanen, 1996], i.e., to find the structure G that max- can also be used: imizes  1  P (D|θˆ(D; G)) reg(N, r) ≈ N log α + (α + 2) log Cα − PNML(D; G) = , (3) Cα P 0 ˆ 0 D0 P (D |θ(D ; G)) 1  2  − log Cα + , (7) where θˆ denotes the (easy to find) maximum like- 2 α lihood parameters and the sum in the denominator r 1 1 q 4 goes over all the possible N × n data matrices. This where α = N and Cα = 2 + 2 1 + α . These approx- information-theoretic NML criterion can be justified imations are compared in Table 1 to the exact regret from the minimum description length point of view for various values of N and r. For a constant N, equa- [Rissanen, 1978, Gr¨unwald, 2007]. It has been shown tion (6) provides a progressively worse approximation to be robust with respect to different data generating as r grows. Equation (7) on the other hand is a good mechanisms where a good choice of prior is challeng- approximation of the regret regardless of the ratio of ing, see [Eggeling et al., 2014, M¨a¨att¨aet al., 2016]. N and r. In our experiments, we will use this approx- While it is easy to see that the NML criterion sat- imation for implementation of the qNML criterion. isfies the requirement of giving equal scores to equal fNML solves the parameter sensitivity problem and structures, the normalizing constant renders the com- yields predictive models superior to BDeu. However, putation infeasible. the criterion does not satisfy the property of giving Consequently, Silander et al. [Silander et al., 2008] the same score for models that correspond to the same suggested solving the BDeu parameter sensitivity dependence statements. The score equivalence is usu- problem by using the NML code for the column par- ally viewed desirable when DAGs are considered only Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures

with Q r different values which can then be mod- Xi∈S i Table 1: Regret values for various values of N and r. 1 qNML eled with a one-dimensional PNML code. The s N r eq. (6) eq. (7) exact score does not necessarily define a distribution for D, 10 13.24 13.26 13.24 but it is easy to verify that it coincides with the NML 100 62.00 60.01 60.00 score for all networks that are composed of fully con- 50 1000 491.63 153.28 153.28 nected components. The number of such networks is 10000 25635.15 265.28 265.28 lower bounded by the number of nonempty partitions of a set of n elements, i.e., the nth Bell number. 10 22.67 22.69 22.67 100 144.10 144.03 144.03 We are now ready to prove some important properties 500 1000 624.35 603.93 603.93 of the qNML score. 10000 4927.24 1533.38 1533.38 10 32.74 32.76 32.74 100 247.97 247.97 247.97 5000 3.1 qNML Is Score Equivalent 1000 1452.51 1451.78 1451.78 10000 6247.83 6043.16 6043.16 qNML yields equal scores for network structures that encode the same set of independencies. Verma and Pearl [Verma and Pearl, 1991] showed that the equiv- as models for , without any alent networks are exactly those which a) are the same causal interpretation. Furthermore, the learned struc- when directed arcs are substituted by undirected ones tures are often rather complex (see Figure 1) which and b) which have the same V-structures, i.e. the vari- also hampers their interpretation. The quest for a able triplets (A, B, C) where both A and B are par- model selection criterion that would yield more parsi- ents of C, but there is no arc between A and B (in monious, easier to interpret, but still predictive Bayes- either direction). Later, Chickering [Chickering, 1995] ian networks structures is one of the main motivations showed that all the equivalent network structures, and for this work. only those structures, can be reached from each other by reversing, one by one, the so-called covered arcs, i.e. 3 QUOTIENT NML SCORE the arcs from node A to B, for which B’s parents other than A are exactly A’s parents (GB = {A} ∪ GA). We will now introduce a quotient normalized maxi- We will next state this as a theorem and sketch a proof mum likelihood (qNML) criterion for learning Bayes- for it. A more detailed proof appears in Appendix A ian network structures. While equally efficient to com- in the Supplementary Material. pute as BDeu and fNML, it is free from hyperparame- ters, and it can be proven to give equal scores to equiv- Theorem 1. Let G and G0 be two Bayesian network alent models. Furthermore, it coincides with the ac- structures that differ only by a single covered arc re- tual NML score for exponentially many models. In our versal, i.e., the arc from A to B in G has been reversed empirical tests it produces models featuring good pre- in G0 to point from B to A, then dictive performance with significantly simpler struc- tures than BDeu and fNML.

Like BDeu and fNML, qNML can be expressed as a sqNML(D; G) = sqNML(D; G0). product of n terms, one for each variable, but unlike the other two, it is not based on further partitioning the corresponding data column

n qNML X qNML s (D; G) := si (D; G) (8) i=1 n 1 Proof. Now the scores for structures can be decom- X PNML(Di,Gi ; G) := log . qNML Pn qNML 1 posed as s (D; G) = i=1 si (D; G) and PNML(DGi ; G) i=1 qNML 0 Pn qNML 0 s (D; G ) = i=1 si (D; G ). Since only the The trick here is to model a subset of columns terms corresponding to the variables A and B in these as though there were no conditional independencies sums are different, it is enough to show that the sum among the corresponding variables S ⊂ X. In this of these two terms are equal for G and G0. Since we case, we can collapse the Q r value configura- can assume the data to be fixed we lighten up the no- Xi∈S i 1 1 tions and consider them as values of a single variable tation and write PNML(i, Gi) := PNML(Di,Gi ; G) and Tomi Silander, Janne Lepp¨a-aho,Elias J¨a¨asaari,Teemu Roos

1 1 PNML(Gi) := PNML(DGi ; G). Now qNML and write P (D | θˆ ) sqNML(D; G) + sqNML(D; G) sqNML(D; G) = log i,Gi i,Gi A B i ˆ 1 1 P (DGi | θGi ) PNML(A, GA) PNML(B,GB) = log 1 1 − (reg(N, qiri) − reg(N, qi)). (10) PNML(GA) PNML(GB) 1 PNML(B,GB) By comparing the equations (9) and (10), we see that = log 1 · 1 PNML(GA) proving our clam boils down to showing two things: P 1 (B,G0 ) P 1 (A, G0 ) 1) the terms involving the maximized likelihoods are = log NML B NML A 1 0 1 0 equal and 2) the penalty terms are asymptotically PNML(GA) PNML(GB) qNML 0 qNML 0 equivalent. We will formulate these as two lemmas. = sA (D; G ) + sB (D; G ), Lemma 1. The maximized likelihood terms in equa- 0 0 tions (9) and (10) are equal: using the equations {A} ∪ GA = GB, {B} ∪ GB = GA, 0 0 {B} ∪ GB = {A} ∪ GA, and GA = GB which follow P (D | θˆ ) i,Gi i,Gi = P (D | θˆ ). easily from the definition of covered arcs. ˆ i i|Gi P (DGi | θGi )

3.2 qNML is Consistent Proof. We can write the terms on the left side of the equation as One important property possessed by nearly every  Nijk model selection criterion is consistency. In our con- ˆ Y Nijk P (Di,G | θi,G ) = , and text, consistency means that given a data matrix i i N j,k with N samples coming from a distribution faithful  Nij Y Nij to some DAG G, the qNML will give the highest P (D | θˆ ) = . Gi Gi N score to the true graph G with a probability tend- j ing to one as N increases. We will show this by first proving that qNML is asymptotically equivalent to Here, Nijk denotes the number of times we observe Xi th the widely used BIC criterion which is known to be taking value k when its parents are in j configuration P consistent [Schwarz, 1978, Haughton, 1988]. The out- in our data matrix D. Also, Nij = k Nijk (and P line of this proof follows a similar pattern to that in k,j Nijk = N for all i). Therefore, [Silander et al., 2010] where the consistency of fNML  Nijk Q Nijk was proved. ˆ j,k P (Di,Gi | θi,Gi ) N = N ˆ  N  ij The BIC criterion can be written as P (DGi | θGi ) Q ij j N n  Nijk X ˆ qi(ri − 1) Q Nijk BIC(D; G) = log P (Di | θi|G ) − log N, j,k N i 2 = i=1  Nijk Q Q Nij (9) j k N where θˆ denotes the maximum likelihood parame- i|Gi = P (D | θˆ ). ters of the conditional distribution of variable i given i i|Gi its parents in G. Since both the BIC and qNML scores are decompos- able, we can focus on studying the local scores. We Next, we consider the difference of regrets in (10) will next show that, asymptotically, the local qNML which corresponds to the penalty term of BIC. The score equals the local BIC score. This is formulated in following lemma states that these two are asymptoti- the following theorem: cally equal:

Theorem 2. Let ri and qi denote the number of pos- Lemma 2. As N → ∞, sible values for variable Xi and its possible configura- qi(ri − 1) tions of parents Gi, respectively. As N → ∞, reg(N, q r ) − reg(N, q ) = log N + O(1). i i i 2

qNML ˆ qi(ri − 1) Proof. The regret for a single multinomial variable si (D; G) = log P (Di | θi|Gi ) − log N. 2 with m categories can be written asymptotically as m − 1 reg(N, m) = log N + O(1). (11) In order to prove this, we start with the definition of 2 Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures

For the more precise statement with the underlying 3.4 qNML is Regular assumptions (which are fulfilled in the multinomial case) and for the proof, we refer to [Rissanen, 1996, Suzuki [Suzuki, 2017] defines regularity for a scoring Gr¨unwald, 2007]. Using this, we have function Qn(X | Y ) as follows: 0 Definition 1. Assume HN (X | Y ) ≤ HN (X | Y ), 0 reg(N, qiri) − reg(N, qi) where Y ⊂ Y. We say that QN (· | ·) is regular if q r − 1 q − 1 Q (X | Y 0) ≥ Q (X | Y ). = i i log N − i log N + O(1) N N 2 2 q r − 1 − q + 1 In the definition, N denotes the sample size, X is = i i i log N + O(1) 2 some , Y denotes the proposed par- q (r − 1) ent set for X, and HN (· | ·) refers to the empiri- = i i log N + O(1). 2 cal conditional entropy. Suzuki [Suzuki, 2017] shows that BDeu violates this principle and demonstrates that this can cause the score to prefer more complex networks even though the data do not support this. Regular scores are also argued to be computationally This concludes our proof since Lemmas 1 and 2 imply more efficient when applied with branch-and-bound Theorem 2. type algorithms for Bayesian network structure learn- ing [Suzuki and Kawahara, 2017]. 3.3 qNML Equals NML for Many Models By analyzing the penalty term of the qNML scoring function, one can prove the following statement: The fNML criterion can be seen as a computationally Theorem 4. qNML score is regular. feasible approximation of the more desirable NML cri- terion. However, the fNML criterion equals the NML criterion only for the Bayesian network structure with Proof. The proof is given in Appendix B in the Sup- no arcs. It can be shown that the qNML criterion plementary Material. equals the NML criterion for all the networks G whose connected components are tournaments (i.e., complete As fNML criterion differs from qNML only by how the directed acyclic subgraphs of G). These networks in- penalty term is defined, we obtain the following result clude the empty network, the fully connected one and with little extra work: many networks in between having different complexity. While the generating network is unlikely to be com- Theorem 5. fNML score is regular. posed of tournament components, the result increases the plausibility that qNML is a reasonable approxima- Proof. The proof is given in Appendix B in the Sup- 1 tion for NML in general . plementary Material. Theorem 3. If G consists of C connected compo- nents (G1,...,GC ) with variable sets (V 1,...,V C ), Suzuki [Suzuki, 2017] independently introduces a qNML then log PNML(D; G) = s (D; G) for all data sets Bayesian Dirichlet quotient (BDq) score that can also D. be shown to be score equivalent and regular. However, like BDeu, this score features a single hyperparameter α, and our initial studies suggest that BDq is also very Proof. The full proof can be found in Appendix sensitive to this hyperparameter (see Appendix D in C in the Supplementary Material. The proof first the Supplementary Material), the issue that was one shows that NML decomposes for these particular of the main motivations to develop a parameter-free structures, so it is enough to show the equivalence model selection criterion like qNML. for fully connected graphs. It further derives the number a(n) of different n-node networks whose connected components are tournaments, which 4 EXPERIMENTAL RESULTS turns out to be the formula for OEIS sequence A0002622. In general this sequence grows rapidly; We empirically compare the capacity of qNML to that 1, 1, 3, 13, 73, 501, 4051, 37633, 394353, 4596553,.... of BIC, BDeu (α = 1) and fNML in identifying the data generating structures, and producing models that are predictive and parsimonious. It seems that none 1A claim that is naturally subject for further study. of the criteria uniformly outperform the others in all 2https://oeis.org/A000262 these desirable aspects of model selection criteria. Tomi Silander, Janne Lepp¨a-aho,Elias J¨a¨asaari,Teemu Roos

4.1 Finding Generating Structure by showing the average rank for each score. The rank- ing was done by giving the score with the lowest SHD In our first , we took five widely used bench- rank 1 and the worst one rank 4. In case of ties, 3 mark Bayesian networks , sampled data from them, the methods with the same SHD were given the same and tried to learn the generating structure with the rank. The shown results are averages computed from different scoring functions using various sample sizes. 5000 values (5 networks, 1000 repetitions). From this, We used the following networks: Asia (n = 5, 8 we can see that qNML never has the worst average arcs), Cancer (n = 5, 4 arcs), Earthquake (n = 5, ranking, and it has the best ranking with sample sizes 4 arcs), Sachs (n = 11, 17 arcs) and Survey (n = 6, greater than 300. This suggests that qNML is over- 6 arcs). These networks were picked in order to use all a safe choice in structure learning, especially with the dynamic programming based exact structure learn- moderate and large sample sizes. ing [Silander and Myllym¨aki, 2006] which limited the number n of variables to less than 20. We measured the quality of the learned structures using structural 2.4 BIC Hamming distance (SHD) [Tsamardinos et al., 2006]. 2.2 BDeu Figure 2 shows SHDs for all the scoring criteria for each 2.0 fNML qNML network. Sample sizes range from 10 to 10000 and the 1.8 shown results are averages computed from 1000 repeti- 1.6 tions. None of the scores dominates in all settings con- sidered. BIC fares well when the sample size is small Average rank 1.4 as it tends to produce a nearly empty graph which is 1.2 a good answer in terms of SHD when the generating 1.0 networks are relatively sparse. qNML obtains strong 101 102 103 104 results in the Earthquake and Asia networks, being Sample size the best or the second best with all the sample sizes considered. Figure 3: Average ranks for the scoring functions in structure learning experiments.

Survey Earthquake 9 8 4.2 Prediction and Parsimony 7 6 5 4 To empirically compare the model selection criteria, we took 20 UCI data sets [Lichman, 2013] and ran 3 2 train and test experiments for all of them. To better 1 0 Sachs Cancer compare the performance over different sample sizes, 30 6 we picked different fractions (10%, 20%,..., 90%) of 20 4 the data sets as training data and the rest as the test data. This was done for 1000 different permutations of 10 2 each data set. The training was conducted using the dynamic programming based exact structure learning 0 0 Asia algorithm. 12 BIC When predicting P (dtest|Dtrain, S, θ) with structures 8 BDeu S learned by the BDeu score, we used the Bayes- fNML ian predictive parameter values (BPP) θijk ∝ 4 1 Nijk + . In the spirit of keeping the scores qNML riqi 0 hyperparameter-free, for structures learned by the 1 2 3 4 10 10 10 10 other model selection criteria, we used the sequen- tial predictive NML (sNML) parametrization θijk ∝ n+1 n e(Nijk)(Nijk + 1), where e(n) = ( n ) as suggested Figure 2: Sample size versus SHD with data generated in [Rissanen and Roos, 2007]. from real world DAGs. For each train/test sample, we ranked the predictive performance of the models learned by the four differ- Figure 3 summarizes the SHD results in all networks ent scores (rank 1 being the best and 4 the worst). Ta- 3Bayesian Network Repository: http://www.bnlearn. ble 2 features the average rank for different data sets, com/bnrepository/ the average being taken over 1000 different train/test Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures

Table 2: Average predictive performance rank over dif- Table 3: Average number of parameters in models for ferent sample sizes for different model selection criteria different scores in 20 different data sets. in 20 different data sets. Data N BDeu BIC fNML qNML Data N BDeu BIC fNML qNML BPP sNML sNML sNML Iris 15 37 23 33 29 PostOpe 90 2.79 1.20 3.06 2.94 PostOpe 18 1217 19 397 146 Iris 150 2.82 2.37 2.27 2.54 Ecoli 34 182 31 162 77 Wine 178 3.23 1.88 2.67 2.22 Liver 35 45 15 61 24 Glass 214 3.61 3.09 1.42 1.88 Wine 36 16521 70 807 205 Thyroid 215 2.55 3.21 1.80 2.44 Glass 44 1677 48 506 97 HeartSt 270 3.12 1.39 3.12 2.37 Thyroid 44 40 23 66 28 BreastC 286 3.09 1.41 2.97 2.53 HeartSt 54 16861 44 1110 256 HeartHu 294 3.18 1.66 2.90 2.27 BreastC 58 25797 49 3767 844 HeartCl 303 3.46 1.38 2.99 2.17 HeartHu 60 1634 43 792 90 Ecoli 336 3.20 3.53 1.24 2.04 HeartCl 62 34381 47 1433 404 Liver 345 3.17 2.39 2.69 1.75 BcWisco 70 4630 42 603 89 Balance 625 3.35 1.91 1.59 3.16 Diabete 77 39 22 216 34 BcWisco 699 3.06 2.03 2.89 2.02 TicTacT 96 13701 25 1969 767 Diabete 768 2.91 2.70 2.68 1.71 Balance 126 20 24 49 611 TicTacT 958 3.44 2.71 1.31 2.53 Yeast 149 71 31 265 75 Yeast 1484 2.60 3.76 1.55 2.10 Abalone 418 91 46 150 63 Abalone 4177 2.60 3.64 1.04 2.72 PageBlo 548 703 45 380 56 PageBlo 5473 2.24 3.61 1.31 2.83 Shuttle 5800 535 99 717 130 Adult 32561 3.23 3.77 1.00 2.00 Adult 6513 699 479 1555 945 Shuttle 58000 1.44 3.78 1.56 3.22

The graphs for different sample sizes for both predic- samples for each 9 sample sizes. BIC’s bias for simplic- tive accuracy and the number of parameters can be ity makes it often win (written bold in the table) with found in Appendix E in the Supplementary Material. small sample sizes, but it performs worst (underlined) for the larger sample sizes (for the same reason), while 5 CONCLUSION fNML seems to be good for large sample sizes. The striking feature about the qNML is its robustness. It We have presented qNML, a new model selection cri- is usually between BIC and fNML for all the sample terion for learning structures of Bayesian networks. sizes making it a “safe choice”. This can be quantified While being competitive in predictive terms, it often if we further average the columns of Table 2, yielding yields significantly simpler models than other common the average ranks of 2.95, 2.57, 2.10, and 2.37, with model selection criteria other than BIC that has a very standard deviations 0.49, 0.90, 0.76, and 0.43. While strong bias for simplicity. The computational cost fNML achieves on average the best rank, the runner- of qNML equals the cost of the current state-of-the- up qNML has the lowest . art criteria. The criterion also gives equal scores for Figure 1 shows how fNML still sometimes behaves models that encode the same independence hypothe- strangely in terms of model complexity as measured ses about the joint probability distribution. qNML by the number of parameters in the model. qNML, also coincides with the NML criterion for many mod- instead, appears to yield more parsimonious models. els. In our experiments, the qNML criterion appears To study the concern of fNML producing too complex as a safe choice for a model selection criterion that bal- models for small sample sizes, we studied the number ances parsimony, predictive capability and the ability of parameters in models produced by different scores to quickly converge to the generating model. when using 10% of each data set for structure learning. Acknowledgements Looking at the number of parameters for the same 20 data sets again features BIC’s preference for simple J.L., E.J. and T.R. were supported by the Academy of models (Table 3). qNML usually (19/20) yields more Finland (COIN CoE and Project TensorML). J.L. parsimonious models than fNML that selects the most was supported by the DoCS doctoral programme of complex model for 7 out of 20 data sets. the University of Helsinki. Tomi Silander, Janne Lepp¨a-aho,Elias J¨a¨asaari,Teemu Roos

References [Lichman, 2013] Lichman, M. (2013). UCI repository. [Akaike, 1973] Akaike, H. (1973). Information theory and an extension of the maximum likelihood princi- [Liu et al., 2012] Liu, Z., Malone, B., and Yuan, C. ple. In Petrox, B. and Caski, F., editors, Proceedings (2012). Empirical evaluation of scoring functions of the Second International Symposium on Informa- for Bayesian network model selection. BMC Bioin- tion Theory, pages 267–281, Budapest. Akademiai formatics, 13(15):S14. Kiado. [M¨a¨att¨aet al., 2016] M¨a¨att¨a,J., Schmidt, D. F., and [Buntine, 1991] Buntine, W. (1991). Theory refine- Roos, T. (2016). Subset selection in linear regression ment on Bayesian networks. In D’Ambrosio, B., using sequentially normalized least squares: Asymp- Smets, P., and Bonissone, P., editors, Proceedings totic theory. Scandinavian Journal of Statistics, of the Seventh Conference on Uncertainty in Arti- 43(2):382–395. ficial Intelligence, pages 52–60. Morgan Kaufmann Publishers. [Pearl, 1988] Pearl, J. (1988). Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Infer- [Chickering, 1995] Chickering, D. M. (1995). A Trans- ence. Morgan Kaufmann Publishers, San Mateo, formational Characterization of Equivalent Bayes- CA. ian Network Structures. In UAI ’95: Proceedings of the Eleventh Annual Conference on Uncertainty in [Rissanen, 1978] Rissanen, J. (1978). Modeling by Artificial Intelligence, pages 87–98. Morgan Kauf- shortest data description. Automatica, 14:445–471. mann. [Rissanen, 1996] Rissanen, J. (1996). Fisher informa- [Eggeling et al., 2014] Eggeling, R., Roos, T., Myl- tion and stochastic complexity. IEEE Transactions lym¨aki,P., Grosse, I., et al. (2014). Robust learning on Information Theory, 42(1):40–47. of inhomogeneous PMMs. In Kaski, S. and Coran- der, J., editors, Proceedings of the Seventeenth In- [Rissanen and Roos, 2007] Rissanen, J. and Roos, T. ternational Conference on Artificial Intelligence and (2007). Conditional NML models. In Proceedings of Statistics, pages 229–237. PMLR. the Information Theory and Applications Workshop [Gr¨unwald, 2007] Gr¨unwald, P. (2007). The Minimum (ITA-07), San Diego, CA. Description Length Principle. MIT Press. [Sachs et al., 2002] Sachs, K., Gifford, D., Jaakkola, [Haughton, 1988] Haughton, D. (1988). On the choice T., Sorger, P., and Lauffenburger, D. A. (2002). of a model to fit data from an . Bayesian network approach to cell signaling path- Annals of Statistics, 16(1):342–355. way modeling. Science’s STKE, 2002(148):38–38.

[Heckerman, 1995] Heckerman, D. (1995). A Bayesian [Savickas and Vasilecas, 2014] Savickas, T. and approach to learning causal network. In Besnard, P. Vasilecas, O. (2014). Bayesian belief network and Hanks, S., editors, Uncertainty in Artificial In- application in process mining. In Proceedings of telligence: Proceedings of the Eleventh Conference, the 15th International Conference on Computer pages 285–295, Montreal, Canada. Systems and Technologies, CompSysTech ’14, pages [Heckerman et al., 1995] Heckerman, D., Geiger, D., 226–233, New York, NY, USA. ACM. and Chickering, D. (1995). Learning Bayesian net- [Schwarz, 1978] Schwarz, G. (1978). Estimating the works: The combination of knowledge and statisti- dimension of a model. Annals of Statistics, 6:461– cal data. Machine Learning, 20(3):197–243. 464. [Ide et al., 2014] Ide, J. S., Zhang, S., and Li, C. S. (2014). Bayesian network models in brain functional [Shtarkov, 1987] Shtarkov, Y. (1987). Universal se- connectivity analysis. International Journal of Ap- quential coding of single messages. Problems of In- proximate Reasoning, 56(1 Pt 1). formation Transmission, 23:3–17. [Kontkanen et al., 2003] Kontkanen, P., Buntine, W., [Silander et al., 2007] Silander, T., Kontkanen, P., Myllym¨aki,P., Rissanen, J., and Tirri, H. (2003). and Myllym¨aki,P. (2007). On sensitivity of the Efficient computation of stochastic complexity. In MAP Bayesian network structure to the equivalent Bishop, C. and Frey, B., editors, Proceedings of the sample size parameter. In Parr, R. and van der Ninth International Conference on Artificial Intelli- Gaag, L., editors, Proceedings of the 23rd Confer- gence and Statistics, pages 233–238. Society for Ar- ence on Uncertainty in Artificial Intelligence (UAI– tificial Intelligence and Statistics. 07), pages 360–367. AUAI Press. Quotient Normalized Maximum Likelihood Criterion for Learning Bayesian Network Structures

[Silander and Myllym¨aki,2006] Silander, T. and Myl- lym¨aki,P. (2006). A simple approach for finding the globally optimal Bayesian network structure. In Dechter, R. and Richardson, T., editors, Proceedings of the 22nd Conference on Uncertainty in Artificial Intelligence (UAI–06), pages 445–452. AUAI Press. [Silander et al., 2008] Silander, T., Roos, T., Kontka- nen, P., and Myllym¨aki,P. (2008). Factorized nor- malized maximum likelihood criterion for learning Bayesian network structures. In Proceedings of the 4th European Workshop on Probabilistic Graphical Models (PGM-08), pages 257–264, Hirtshals, Den- mark. [Silander et al., 2010] Silander, T., Roos, T., and Myl- lym¨aki,P. (2010). Learning locally minimax op- timal Bayesian networks. International Journal of Approximate Reasoning, 51(5):544 – 557. [Steck, 2008] Steck, H. (2008). Learning the Bayes- ian network structure: Dirichlet prior vs data. In McAllester, D. A. and Myllym¨aki,P., editors, Pro- ceedings of the 24th Annual Conference on Un- certainty in Artificial Intelligence (UAI–08), pages 511–518. AUAI Press. [Suzuki, 2017] Suzuki, J. (2017). A theoretical analy- sis of the BDeu scores in Bayesian network structure learning. Behaviormetrika, 44(1):97–116.

[Suzuki and Kawahara, 2017] Suzuki, J. and Kawa- hara, J. (2017). Branch and bound for regular Bayesian network structure learning. The 33rd Con- ference on Uncertainty in Artificial Intelligence, UAI 2017, August 11-15, 2017, Sydney, Australia.

[Szpankowski and Weinberger, 2012] Szpankowski, W. and Weinberger, M. J. (2012). Minimax pointwise redundancy for memoryless models over large alphabets. IEEE Transactions on Information Theory, 58(7):4094–4104.

[Tsamardinos et al., 2006] Tsamardinos, I., Brown, L. E., and Aliferis, C. F. (2006). The max-min hill- climbing Bayesian network structure learning algo- rithm. Machine Learning, 65(1):31–78. [Verma and Pearl, 1991] Verma, T. and Pearl, J. (1991). Equivalence and synthesis of causal mod- els. In Proceedings of the Sixth Annual Conference on Uncertainty in Artificial Intelligence (UAI-90), pages 255–270, New York, NY, USA. Elsevier Sci- ence Inc.