IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 1 cuPC: CUDA-based Parallel PC Algorithm for Causal Structure Learning on GPU
Behrooz Zarebavani, Foad Jafarinejad, Matin Hashemi, and Saber Salehkaleybar This article is published. Please cite as B. Zarebavani, F. Jafarinejad, M. Hashemi, S. Salehkaleybar, “cuPC: CUDA-based Parallel PC Algorithm for Causal Structure Learning on GPU,” IEEE Transactions on Parallel and Distributed Systems (TPDS), 2019. doi: 10.1109/TPDS.2019.2939126
Abstract—The main goal in many fields in the empirical sciences is to discover causal relationships among a set of variables from observational data. PC algorithm is one of the promising solutions to learn underlying causal structure by performing a number of conditional independence tests. In this paper, we propose a novel GPU-based parallel algorithm, called cuPC, to execute an order-independent version of PC. The proposed solution has two variants, cuPC-E and cuPC-S, which parallelize PC in two different ways for multivariate normal distribution. Experimental results show the scalability of the proposed algorithms with respect to the number of variables, the number of samples, and different graph densities. For instance, in one of the most challenging datasets, the runtime is reduced from more than 11 hours to about 4 seconds. On average, cuPC-E and cuPC-S achieve 500 X and 1300 X speedup, respectively, compared to serial implementation on CPU. The source code of cuPC is available online [1].
Index Terms—Bayesian Networks, Causal Discovery, CUDA, GPU, Machine Learning, Parallel Processing, PC Algorithm. !
1 INTRODUCTION tests. Sprites and Glymour [4] proposed a promising so- lution, called PC algorithm. For ground-truth graphs with EARNING causal structures is one of the main problems bounded degrees, PC algorithm does not require to perform L in empirical sciences. For instance, we need to under- high-order conditional independence tests, and thus, runs in stand the impact of a medical treatment on a disease or polynomial time. PC algorithm has become a common tool recover causal relations between genes in gene regulatory for causal explorations and is available in different graphical networks (GRN) [2]. By discovering such causal relations, model learning packages such as pcalg [6], bnlearn [7], one will be able to predict the impact of different actions. and TETRAD [8]. Moreover, it has been widely applied in Causal relations can be inferred by controlled randomized different applications such as learning the causal structure experiments. However, in many cases, it is not possible to of GRNs from gene expression data [9], [10]. Furthermore, a perform the required experiments due to technical or ethical number of causal structure learning algorithms, for instance, reasons. In such cases, causal relations need to be learned FCI and its variants such as RFCI [4], [11], and CCD algo- merely from observational data [3], [4]. rithm [12], use PC algorithm as a subroutine. Causal Bayesian network is one of the models which PC algorithm starts from a complete undirected graph has been widely considered to explain the data-generating and removes the edges in consecutive levels based on mechanism. In this model, causal relations among variables carefully selected conditional independence tests. However, are represented by a directed acyclic graph (DAG) where performing these number of tests might take a few days there is a direct edge from variable Vi to variable Vj if on a single machine in some gene expression data such Vi is a direct cause of Vj. The task of causal structure arXiv:1812.08491v4 [cs.DC] 23 Sep 2019 as DREAM5-Insilico dataset [13]. Furthermore, the order of learning is to learn all DAGs that are compatible with the performing conditional independence tests may affect the observed data. Under some assumptions [4], the underlying final result. Parallel implementations of PC algorithm on true causal structure is in the set of recovered DAGs if multi-core CPUs have been proposed in [14], [15]. In [16], the number of observed data samples goes to infinity. Two Colombo and Maathuis proposed a variant of PC algorithm common approaches for learning causal structures are score- called PC-stable which is order-independent and produces based and constraint-based approaches. In the score-based less error compared with the original PC algorithm. The key approach, in order to find a set of DAGs that best explains property of PC-stable is that removing an edge in a level has dependency relations among the variables, a score function no effect on performing conditional independence tests of is evaluated, which might become an NP-hard problem [5]. other edges in that level. This order-independent property In the constraint-based approach, such DAGs are found makes PC-stable suitable for executing on multi-core ma- by performing a number of conditional independence (CI) chines. In [17], Le et al. proposed a parallel implementation of PC-stable algorithm on multi-core CPUs, called Parallel- The authors are with Learning and Intelligent Systems Laboratory, Depart- PC, which reduces the runtime by an order of magnitude. ment of Electrical Engineering, Sharif University of Technology, Tehran, For instance, it takes a couple of hours to process DREAM5- Iran. Webpage: http://lis.ee.sharif.edu/ E-mails: [email protected], [email protected], [email protected] (corresponding author), [email protected]. Insilico dataset. In case of using GPU hardware, there was doi: 10.1109/TPDS.2019.2939126 an attempt for parallelization of the PC-stable algorithm IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 2
in [18]. However, only a small part (only level zero and The rest of this paper is organized as follows. In Sec- level one) of the PC-stable algorithm is parallelized in this tion 2, we review some preliminaries on causal Bayesian method, and thus, it cannot be used as a complete solution networks and description of PC-stable. In Section 3, we in many datasets which require more than two levels. In present the two variants of cuPC algorithm, cuPC-E and fact, their approach cannot be generalized to level two and cuPC-S. Furthermore, we elaborate details of our contri- beyond. butions in Section 4. We conduct experiments to evaluate In this paper, we propose a GPU-based parallel algo- the performance and scalability of the proposed solution in rithm, called “cuPC”, for learning causal structures based Section 5 and conclude our results in Section 6. on PC-stable. We assume that there is no missing observa- tions for any variable, and data has multivariate normal 2 PRELIMINARIES distribution. In order to execute PC-stable, one needs to perform conditional independence tests to evaluate whether 2.1 Bayesian Networks two variables Vi and Vj are independent given another set Consider a set of random variables V = {V1,V2,...,Vn}. of variables S. The proposed algorithm has two variants, Given X,Y,Z ⊆ V, a conditional independence (CI) called “cuPC-E” and “cuPC-S”, which employ the following assertion of the form X ⊥⊥ Y |Z means X and Y are ideas. independent given Z.A CI test of the form I(X,Y |Z) is I) cuPC-E employs two degrees of parallelism at the a test procedure based on observed data samples from X, same time. First is performing tests for multiple edges in Y and Z which determines whether the corresponding CI parallel and second, is parallelizing the tests which are assertion X ⊥⊥ Y |Z holds or not. Section 4.3 describes how performed for a given edge. Although abundant parallelism to perform CI tests from observed-data samples. is available, parallelizing all such tests does not yield the Graphical model G is a graph which encodes a joint highest performance because it incurs different overheads distribution P over the random variables in V. The rea- and also results in many unnecessary tests. Instead, cuPC- son behind the development of a graphical model is that E judiciously strikes a balance between the two degrees of the explicit representation of the joint distribution becomes parallelism in order to efficiently utilize the parallel comput- infeasible as the number of variables grows. Furthermore, ing capabilities of GPU and avoid launching unnecessary under some assumptions on the data generating model, one tests at the same time. In addition, cuPC-E employs two can interpret causal relations among the variables from these configuration parameters which can be adjusted to tune the graphs [19]. performance and achieve high speedup in both sparse and Bayesian Networks (BN) are a class of graphical models dense graphs. that represent a factorization of P over V by a directed II) A conditional set S might be common in tests of many acyclic graph (DAG) G = (V, E) as pairs of variables. cuPC-S takes advantage of this property n Y and reuses the results of computations in one of such tests P (V1,V2,...,Vn) = P (Vi|par(Vi)), (1) in the others. This sharing can be performed in different i=1 ways. For instance, sharing all redundant computations in where E is the set of edges, and par(V ) denotes parents processing the entire graph might first seem more beneficial, i of V in G. Moreover, the graph G encodes conditional but it has non-justifiable overheads. Hence, cuPC-S employs i independence between the random variables in V by some a carefully-designed local sharing strategy in order to avoid notion of separation in graphs. different overheads and achieve significant speedup. A Causal Bayesian Network (CBN) is a BN where each III) cuPC-E and cuPC-S parallel algorithms avoid storing directed edge represents a cause-effect relationship from the the indices of variables in set S. Instead, a combination parent to its child. For the exact definition of CBN, please function is employed to compute the indices on-the-fly and refer to [20], Section 1.3. A CBN satisfies causal Markov also in parallel. IV) The causal structure is represented by condition, i.e., given par(V ), variable V is independent of an adjacency matrix which is compacted before starting i i any variable V that there is no directed path from V to V . the computations in every level. The compacted format is j i j Let I(P ) be the set of all CI assertions that holds in P . Under judiciously selected to assign and execute parallel threads causal Markov condition and faithful assumptions [4], all CI more efficiently, and also, improves cache performance. V) assertions in I(P ) are encoded in the true causal graph G GPU shared memory is used in order to improve perfor- [21]. mance. VI) Where applicable, threads are terminated early in order to avoid performing unnecessary computations. For instance, edge removals are monitored in parallel, and when 2.2 CPDAG an edge is removed in another thread or another block, the In a directed graph G, we say that three variables rest of the tests on that edge are skipped. Vi,Vk,Vj ∈ V form a v-structure at Vk if variables Vi Experiments on multiple datasets show the scalability and Vj have an outgoing edge to variable Vk while they of the proposed parallel algorithms with respect to the are not connected by any edge in G. This is denoted by number of variables, the number of samples, and different Vi → Vk ← Vj. The skeleton of a directed graph G graph densities. For instance, in one of the most challenging is an undirected graph that contains edges of G without datasets, cuPC-S can reduce the runtime of PC-stable from considering their orientations. more than 11 hours to about 4 seconds. On average, cuPC- For a given joint distribution P , there might be different E and cuPC-S achieve about 500 X and 1300 X speedup, DAGs that can represent I(P ). The set of all such DAGs respectively, compared to serial implementation on CPU. is called Markov equivalence class [22]. It can be shown IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 3 that two DAGs are in the same Markov equivalence class Algorithm 1 The first step in PC-stable algorithm. if they have the same skeleton and the same set of v- Input: V structures [23]. A Markov equivalence class can be repre- Output: G, SepSet sented uniquely by a mixed graph called completed partial 1: G = fully connected graph DAG (CPDAG). In particular, there is a directed edge in 2: SepSet = ∅ CPDAG from Vi to Vj if this edge exists with the same 3: ` = 0 direction in all DAGs in the Markov equivalent class. There 4: repeat 0 is an undirected edge between Vi and Vj in CPDAG if there 5: Copy G into G exist two DAGs in the Markov equivalence class which have 6: for any edge (Vi,Vj) in G do an edge between Vi and Vj but with different orientations. 7: repeat 0 8: Choose a new S ⊆ adj(Vi,G )\{Vj} with |S| = ` 2.3 Causal Structure Learning 9: Perform I(Vi,Vj|S) 10: if V ⊥⊥ V |S then Causal structure learning, our focus in this paper, is the i j 11: Remove (V ,V ) from G problem of finding a CPDAG which best describes depen- i j 12: Store S in SepSet dency relations in a given data that is sampled from the ran- 13: end if dom variables in V. In the literature, two main approaches 14: until (V ,V ) is removed or all sets S are considered have been proposed for causal structure learning [24]: i j 15: end for constraint-based approach and score-based approach. 16: ` = ` + 1 In the constraint-based approach, CI tests are utilized 17: until ( max degree −1 ≥ ` ) to recover the CPDAG. Examples include PC [4], Rank PC [25], PC-stable [16], IC [26], and FCI [4]. In the score- based approach, a score function indicates how well each I (3,0 | {2}) = ✕ I (3,0 | {1}) = ✕
DAG explains dependency relations in the data. Then, a time 0 1 CPDAG with the highest score is obtained by searching G over Markov equivalence classes. Examples include Chow- 2 3 Liu [27] and GES [28] algorithms. There are other methods I (2,3 | {0}) = ✓ I (2,0 | {3}) = ✕ such as LiNGAM [29], [30], and BACKSHIFT [31] which do I (2,3) = ✕ 0 1 not belong to any of the above two categories because their I (1,3) = ✕ G 0 1 2 3 underlying assumptions are more restricted or their settings G are different. 2 3 I (1,3 | {0}) = ✓ I (1,0 | {3}) = ✕ Choosing between the types of algorithms depends on I (1,2) = ✓ I (0,3 | {2}) = ✕ the characteristics of the data [32]. For instance, Scutari I (0,3) = ✕ I (0,3 | {1}) = ✕ I (0,2 | {3}) = ✕ I (0,2) = ✕ I (0,1 |{2 ,3}) = ✕ et al. [33] concluded that constraint-based algorithms are I (0,2 | {1}) = ✕ more accurate than score-based algorithms for small sample I (0,1) = ✕ I (0,1 | {3}) = ✕ I (0,2 |{1 ,3}) = ✕ I (0,3 |{1 ,2}) = ✕ sizes and that they are as accurate as hybrid algorithms. PC 0 1 I (0,1 | {2}) = ✕ G 0 1 0 1 algorithm, as one of the main constraint-based algorithms, 2 3 G' G' has become a common tool for causal explorations and is Create complete 2 3 2 3 available in different graphical model learning packages [6], graph G Copy G into G' Copy G into G' [7], [8]. In addition, a number of causal structure learning Level 0 Level 1 Level 2 algorithms utilize PC algorithm as a subroutine [4], [11], Fig. 1. An example of execution of PC-stable algorithm. For better [12]. readability, we use the term i instead of Vi. For instance, I(0, 1) actually means I(V0,V1). This is done in Fig. 3 and Fig. 4 as well. 2.4 PC-stable Algorithm In the constraint-based approach, a naive solution to check called Meek rules [35]. The second step is fairly fast. The first whether there is an edge between two variables Vi and Vj in step is computationally intensive [15] and forms our focus the CPDAG is to perform all CI tests of the form I(Vi,Vj|S) in this paper. For instance, in ground truth graphs with where S ⊆ V\{Vi,Vj}. This solution is computationally a bound ∆ on the maximum degree, the time complexity infeasible for large number of variables due to exponentially of PC-stable algorithm is in the order of O(n∆). Sections 3 growing number of CI tests. and 4 present our proposed solution for acceleration of this Unlike the naive solution, the PC algorithm is com- step on GPU. Details of the first step are described in the putationally efficient for sparse graphs with up to thou- following. sands number of variables and is commonly used in high- See Algorithm 1. First, G is initiated with a fully con- dimensional settings [34]. Here we describe PC-stable al- nected undirected graph over set V (line 1). Next, the extra gorithm which is a variation of PC with less estimation edges are removed from G by performing a number of CI errors [16]. tests. The tests are performed by levels. In each level `, first PC-stable algorithm consists of two main steps: In the a copy of G is stored in G0 (line 5). Next, for every edge first step, the skeleton is determined by performing a num- (Vi,Vj) in graph G, a CI test I(Vi,Vj|S) is performed for 0 ber of carefully selected CI tests. In the second step, the set any S ⊆ adj(Vi,G )\{Vj} such that |S| = ` (lines 6 − 9), 0 0 of v-structures are extracted and as many of the undirected where adj(Vi,G ) denotes the neighbors of Vi in G (see edges as possible are oriented by applying a set of rules Section 4.3 for the details of performing a CI test). If there IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 4
exists a set S where Vi is independent of Vj given S (line Algorithm 2 Overall view of the proposed solution. Lines 7, 10), edge (Vi,Vj) is removed from G, and S is stored in 9, and 10 are executed in parallel on GPU. SepSet (lines 11−12). Once all the edges are considered, ` is Input: V incremented (line 16) and the above procedure is repeated. Output: G, SepSet The algorithm continues as long as the maximum degree of 1: G = fully connected graph the graph is large enough (line 17). The second step in PC- 2: SepSet = ∅ stable is to use SepSet to find v-structures and orient the 3: ` = 0 edges of graph G. 4: AG = adjacency matrix of graph G Fig. 1 illustrates execution of the first step on a small 5: repeat graph. In level ` = 0, six CI tests are performed, one for 6: if (` == 0) then every edge in the fully connected graph. Assuming that the 7: GPU: execute level zero result of the fourth CI test is true, we have V1 ⊥⊥ V2, and 8: else 0 hence, edge (V1,V2) is removed. In level ` = 1, 12 CI tests 9: GPU: compact AG into AG are performed and edges (V1,V3) and (V2,V3) are removed. 10: GPU: execute level ` 0 Note that by selecting the conditional sets S from G but 11: end if removing edges from G, the algorithm finally reaches the 12: ` = ` + 1 same graph regardless of the edge selection order. In other 13: until ( max degree −1 ≥ ` ) words, during the execution of the algorithm in a level, S only depends on G0. Since performing CI tests in each level is independent of the edge selection order, making an error Threads are grouped into blocks. A kernel consists of in one of the CI tests does not have any impact on other CI a number of blocks, and every block consists of a number tests in that level. of threads. Every block has access to a small, on-chip and low-latency memory, called shared memory. The shared 3 CUPC:CUDA-ACCELERATED PCALGORITHM memory of a block is accessible to all threads within that 1 This section presents our proposed solution for acceleration block, but not to any thread from other blocks . of the computationally-intensive portion of PC-stable (lines In order to identify blocks within a kernel, and also, 5 − 15 in Algorithm 1) on GPU using CUDA parallel pro- threads within a block, a set of indices are used in the CUDA gramming API. We assume that there is no missing obser- API, for instance, blockIdx.y and blockIdx.x as the block vations for any variable, and data has multivariate normal index in dimension y and dimension x within a 2D kernel, distribution. The overall view of the proposed method is and threadIdx.y and threadIdx.x as the thread index in shown in Algorithm 2. The main loop on ` which iterates dimensions y and x within a 2D block. For brevity, we through the levels still exists in the proposed solution, but denote these four indices as by, bx, ty and tx, respectively. the internal computations of every level are accelerated on GPU. In specific, since the computations of level zero can 3.2 Level ` = 0 be simplified, a separate parallel algorithm is employed for S ` this level (line 7 in Algorithm 2). For every level ` ≥ 1, Size of conditional sets is equal to (Algorithm 1, line 8 S = ∅ first G is copied into G0 (line 9), and then, the required ). As a result, in level zero, , and therefore, the computations are performed (line 10). Note that we work required computations can be simplified. In specific, for (V ,V ) G on adjacency matrix of graph G denoted as A . In order to every edge i j in , only one CI test is required, which G I(V ,V |∅) I(V ,V ) G increase the efficiency of the proposed parallel algorithms, is i j or simply i j . In addition, copying into G0 A0 is a compacted version of A . The details are discussed is not required. G G I(V ,V ) in the following. All the required CI tests i j are performed in A short background on CUDA is presented in Sec- parallel as shown in Algorithm 3. Since the input graph tion 3.1. Acceleration of level ` = 0 is discussed in Sec- in level zero is a fully connected undirected graph, a total n(n − 1)/2 tion 3.2. For levels ` ≥ 1, two different parallel algorithms of tests are required, i.e., one for every edge. I(V ,V ) n2 called cuPC-E and cuPC-S are proposed. cuPC-E and the Every test i j is assigned to a separate thread and compact procedure are discussed in Section 3.3. cuPC-S is threads are launched. Threads are grouped in a 2D kernel of n/32 × n/32 32 × 32 discussed in Section 3.4. Further details on some parts of blocks. Every block has threads. Indices i j 1 − 2 0 ≤ by, bx < n/32 the proposed solution are discussed later in Section 4. and are calculated in lines . Here, and 0 ≤ ty, tx < 32. Lines 4 − 7 are executed in only n(n − 1)/2 threads. In 3.1 CUDA line 4, the CI test I(Vi,Vj) is performed. In lines 5 − 7, the CUDA is a parallel programming API for Nvidia GPUs. edge (Vi,Vj) is removed from graph G if Vi ⊥⊥ Vj. The term GPU is a massively parallel processor with hundreds to AG denotes the adjacency matrix of graph G. Edge (Vi,Vj) thousands of cores. CUDA follows a hierarchical program- is removed from G by setting AG[i, j] = AG[j, i] = 0. ming model. At the top level, computationally intensive functions are specified by the programmer as CUDA ker- 1. This article is presented based on CUDA programming frame- nels. A kernel is specified as a sequential function for a work. However, the presented ideas and parallel algorithms can readily single thread. The kernel is then launched for parallel ex- be ported to OpenCL programming framework for other GPU vendors as well. In specific, block, thread and shared memory in CUDA pro- ecution on the GPU by specifying the number of concurrent gramming framework correspond to work-group, work-item, and local threads. memory in OpenCL programming framework. IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS 5 Algorithm 3 Acceleration of level ` = 0. See Section 3.2. Algorithm 4 Acceleration of level ` ≥ 1 with parallel Input: AG algorithm cuPC-E. See Section 3.3 and Fig. 3. 0 Output: AG Input: AG, AG, ` n n # of blocks: /32 × /32 Output: AG, SepSet 0 # of threads / block: 32 × 32 # of blocks: n × n /β 1: i = by × 32 + ty # of threads / block: γ × β 2: j = bx × 32 + tx 1: i = by 0 0 3: if (i < j) then 2: ni = size of row i in AG 0 0 4: Perform I(Vi,Vj) 3: Copy the entire row i from matrix AG into vector Ash 5: if (Vi ⊥⊥ Vj) then in shared memory 6: AG[i, j] = AG[j, i] = 0 4: p = bx × β + tx 0 7: end if 5: j = Ash[p] n0 −1 8: end if ′ i A G A G 6: for (t = ty; t < ` ; t = t + γ) do 7: if (AG[i, j] == 1) then 0 AG A’G 8: P1×` = Comb(ni − 1, `, t, p) 0123456 012345 S = A0 [P ] 0 0 1 1 1 1 1 0 0 1 2 3 4 5 9: 1×` sh 1 1 0 1 0 0 0 0 1 0 2 10: Perform I(Vi,Vj|S) Compact 2 1 1 0 1 1 1 1 2 0 1 3 4 5 6 11: (V ⊥⊥ V |S) 3 1 0 1 0 1 0 0 3 0 2 4 if i j then 4 1 0 1 1 0 1 0 4 0 2 3 5 12: AG[i, j] = AG[j, i] = 0 5 1 0 1 0 1 0 0 5 0 2 4 , 13: Store S in SepSet 6 0 0 1 0 0 0 0 n× n 6 2 n× n 14: end if 0 Fig. 2. AG is formed by compacting AG. 15: end if 16: end for
3.3 Level ` ≥ 1: Parallel Algorithm cuPC-E bx=0 bx=1 tx=0 tx=1 tx=2 Two different parallel algorithms (kernels) are proposed for 0 1 2 3 4 5 p: 0 1 2 3 4 5 acceleration of every level ` ≥ 1. This section describes the 1 0 2 0 1 3 j=4 j=5 j=6 first algorithm, called cuPC-E. See Algorithm 4. ′ 0 2 0 1 3 4 5 6 (c) Ash for block )1,2( Compact: cuPC-E takes `, AG and AG as input. As shown 0 by 3 0 2 4 in line 9 in Algorithm 2, AG is formed by compacting tx=0 tx=1 tx=2 adjacency matrix AG into a sparse representation. Fig. 2 4 0 2 3 5 i=2 i=2 i=2 j=4 j=5 j=6 illustrates a small example. An element with value j in i-th 5 0 2 4 0 0 t=0: S={0,1} row in AG denotes existence of edge (Vi,Vj) in AG. AG has ty=0 … t=2: S={0,4} … 0 0 0 6 2 n rows. Row i has ni elements, i.e., edges. Let n = max ni. t=4: S={1,3} 0≤i