Learning Bayesian Networks Through Birkhoff Polytope: a Relaxation Method

1 Learning Bayesian Networks through Birkhoff Polytope: A Relaxation Method Aramayis Dallakyan, Mohsen Pourahmadi Abstract—We establish a novel framework for learning a directed acyclic graph (DAG) when data are generated from a Gaussian, linear structural equation model. It consists of two parts: (1) introduce a permutation matrix as a new parameter within a regularized Gaussian log-likelihood to represent variable ordering; and (2) given the ordering, estimate the DAG structure through sparse Cholesky factor of the inverse covariance matrix. For permutation matrix estimation, we propose a relaxation technique that avoids the NP-hard combinatorial problem of order estimation. Given an ordering, a sparse Cholesky factor is estimated using a cyclic coordinatewise descent algorithm which decouples row-wise. Our framework recovers DAGs without the need for an expensive verification of the acyclicity constraint or enumeration of possible parent sets. We establish numerical convergence of the algorithm, and consistency of the Cholesky factor estimator when the order of variables is known. Through several simulated and macro-economic datasets, we study the scope and performance of the proposed methodology. Index Terms—Bayesian Networks, sparse Cholesky factorization, Directed Acyclic Graphs, Permutation relaxation F 1 INTRODUCTION methods [10], [22]–[24] where the topological ordering is AYESIAN considered as a parameter [10]. The order-based search Networks (BNs) are a popular class of graph- O(p log p) ical models whose structure is represented by a DAG has two main advantages: the ordering space (2 ) is B O(p2) G. BNs have been used in many applications such as significantly smaller than the DAG search space (2 ), economics, finance, biology, etc [1]–[5]. In recent years the and the existence of ordering guarantees satisfaction of the following two approaches have been evolved to learn the acyclicity constraint. structure of the underlying DAG from data: Independence- The recent Annealing on Regularized Cholesky Score based (also called constraint-based) methods [6], [7] and (ARCS) algorithm in [24] is based on representing an order- score-based methods [8]–[11]. Here, structure learning refers ing by the corresponding permutation matrix P , and then to recovering DAG from observational data. given the order, encoding the weighted adjacency matrix B Independence-based methods, such as the inductive cau- into the Cholesky factor L of the inverse covariance matrix. sation (IC) [7] and PC (Peter-Clark) [6] algorithm, utilize ARCS optimizes a regularized likelihood score function to conditional independence tests to detect the existence of recover sparse DAG structure and utilizes simulated an- edges between each pair of variables. The method assumes nealing (SA) to search over the permutation matrix space. that the distribution is Markovian and faithful with respect In SA, using a pre-specified constant m and a temperature to the underlying DAG, where P is faithful to the DAG G schedule fT (i); i = 0;:::;Ng, in the ith iteration a new if all conditional independencies in P are entailed in G and permutation matrix P ∗ is proposed by flipping a fixed- Markovian if the factorization property (1) is satisfied. length m random interval in the current permutation P^, and In contrast, score-based methods measure the goodness checking whether to stay at the current P^ or move to the of fit of different graphs over data by optimizing a score proposed P ∗ with some probability. arXiv:2107.01658v1 [stat.ML] 4 Jul 2021 function with respect to the unknown (weighted) adjacency Motivated by the ARCS two-step framework, we pro- matrix B with a combinatorial constraint that the graph is pose an order-based method for learning Gaussian DAGs by DAG. Then a search procedure is used to find the best graph. optimizing a non-convex regularized likelihood score func- Commonly used search procedures include hill-climbing [8], tion with the following distinct features and advantages: [12], forward-backward search [9], dynamic, and integer programming [13]–[17]. Recently, [18], [19] proposed a fully First, we use a relaxation technique instead of the expen- continuous optimization for structure learning by introduc- sive search for a permutation matrix P in the non-convex ing a novel characterization of acyclicity constraint. space of permutation matrices. More precisely, we project Generally, the DAG search space is intractable for a large P onto the Birkhoff polytope (the convex space of doubly number of nodes p and the task of finding a DAG is NP- stochastic matrices) and then find the “closest” permutation hard [9]. Consequently, approximate methods have been matrix to the optimal doubly stochastic matrix (See Fig- proposed with additional assumptions such as bounded ure 2). Second, given P , we resort to the cyclic coordinate- maximum indegree of the node [20] or tree-like structures wise algorithm to recover the DAG structure entailed in the [21]. Alternatively, the ordering space (or the space of Cholesky factor L. We show that the optimization reduces topological ordering) has been exploited for score-based to p decoupled penalized regressions where each iteration of the cyclic coordinatewise algorithm has a closed form Department of Statistics, Texas A&M University, College Station, TX, 77843 solution. Third, we show consistency of our Cholesky factor 2 estimator for the non-convex score function when the true permutation matrix is known. To the best of our knowledge, consistency results for the sparse Cholesky factor estimator were established only for convex problems [25], [26]. The paper is organized as follows: Section 2 introduces background on Gaussian BNs and structural equation models (SEMs). In Section 3, we derive and discuss the form of the score function. In Section 4, we introduce our Relaxed Regularized Cholesky Factor (RRCF) framework. The anal- yses of the simulated and real macro-economic datasets are contained in Section 5. For the real data analysis, we apply RRCF to solve the price puzzle, a classic problem in the economics literature. Section 6 provides statistical consistency of our estimator, and we conclude with a discussion Fig. 1: Illustration of DAG G, corresponding coefficient main Section 7. trix B, permutation matrix P , and permuted strictly lower triangular matrix Bπ. 2 BAYESIAN NETWORKS t We start by introducing the following graphical concepts. If of B to a strictly lower triangular matrix Bπ = PπBPπ the graph G contains a directed edge from the node k ! j, by permuting rows and columns of B, respectively [27] G then k is a parent of its child j. We write Πj for the set (see Figure 1 for the illustrative example). Therefore, the of all parents of a node j. If there exist a directed path stringent acyclicity constraint on B transforms into the k !···! j, then k is an ancestor of its descendant j.A constraint that Bπ is a strictly lower triangular matrix, then Bayesian Network is a directed acyclic graph G whose nodes the linear SEM can be rewritten as represent random variables X1;:::;Xp. Then G encodes a set of conditional independencies and conditional probabil- PπX = BπPπX + Pπ"; (4) ity distributions for each variable. The DAG G = (V; E) is t using the fact that PπPπ = I. From (4), the inverse covari- characterized by the node set V = f1; : : : ; pg and the edge G ance matrix can be expressed as set E = f(i; j): i 2 Πj g ⊂ V × V . It is well-known that for −1 t −1 a BN, the joint distribution factorizes as: Σπ = (I − Bπ) Ωπ (I − Bπ); (5) p Ω = P ΩP t L = Y where π π π. Using (4) and (5) and defining π P (X ;:::;X ) = P (X jΠG) −1=2 1 p j j (1) Ωπ (I −Bπ), the relationship between the Cholesky factor j=1 −1 t Lπ of the inverse covariance matrix Σπ = LπLπ and the matrix Bπ is 2.1 Gaussian BN and Structural Equation Models p (Lπ)ij = −(Bπ)ij= !j; and It is known that a Gaussian BN can be equivalently repre- (6) (L ) = 0 () (B ) = 0 for every i ≥ j sented by the linear SEM [7]: π ij π ij Hence, L preserves the DAG structure of B ; i.e., non-zero X π π Xj = βjkXk + "j; j = 1; : : : ; p; (2) elements in Lπ correspond to directed edges in DAG G. G k2Πj 2 HE CORE UNCTION where "j ∼ N(0;!j ) are mutually independent and inde- 3 T S F G pendent of fXk : k 2 Πj g. Denoting B = (βjk) with zeros In this section, given data from the Gaussian BN (or SEM), along the diagonal, the vector representation of (2) is we derive the form of the score function used to recover the underlying DAG structure. A natural choice for such X = BX + "; (3) function is the log-likelihood function, which will be used t t where " := ("1;:::;"p) and X := (X1;:::;Xp) . Thus, one for the estimation of the permutation and Cholesky fac- can characterize the linear SEM X ∼ (B; Ω) by the weighted tor matrices. We assume that each row of data matrix n×p adjacency matrix B and the noise variance matrix Ω = X = (X1;:::;Xp) 2 R is an i.i.d observation from (2). 2 2 diag(!1;:::;!p). From (3), the inverse covariance matrix of Using reformulation (4), X ∼ N (0; Σ) is Σ−1 = (I − B)tΩ−1(I − B), and the edge p XP t = XP t Bt + EP t ; (7) set of the underlying DAG is equal to the support of the π π π π weighted adjacency matrix B; i.e., E = f(k; j): βjk 6= 0g, where each row of E is an i.i.d Np(0; Ω) vector. Thus, each t which defines the structure of DAG G.

Load more