<<

Parallel implementations of random time for chemical network

Chuanbo Liua, Jin Wanga,b,∗

aState Key Laboratory of Electroanalytical Chemistry, Changchun Institute of Applied Chemistry, Chinese Academy of Sciences, Jilin, People’s Republic of China bDepartment of Chemistry, Physics and Applied , State University of New York at Stony Brook, Stony Brook, USA

Abstract In this study, we have developed a parallel version of the random time algorithm. Firstly, we gave a rigorous basis of the random time description of the stochastic process of chemical reaction network time evolution. And then we reviewed the random time simulation algorithm and gave the implementations for the parallel version of next reaction random time algorithm. The discussion of computational complexity suggested a factor of M (which is the connection number of the network) folds time consuming reduction for random time simulation algorithm as compared to other exact stochastic simulation , such as the . For large-scale , such like the protein- protein interaction network, M is on order of 108. We further demonstrate the power of random time simulation with a GPGPU parallel implementation which achieved roughly 100 folds acceleration as compared with CPU implementations. Therefore the stochastic simulation method we developed here can be of great application value for simulating time evolution process of large-scale network. Keywords: Random time, Stochastic simulations, Parallel algorithm, GPGPU

1. Introduction such like StochKit. We here demonstrated a new imple- mentation of random time algorithm with GPGPU for Chemical reaction network time evolution is intrinsi- stochastic simulations. By carefully arranging the data cally stochastic. Starting by Gillespie [1], a lot of algo- distribution on global and local memories we can acceler- rithms had been developed for stochastic simulation of ate the stochastic simulation with a data parallel manner the time evolution of chemical reaction network, such like for roughly 100 folds. In this paper, we first briefly re- the First Reaction Method (FRM), Direct Method (DM). viewed the methods of random time simulation, then dis- Some improved methods had also been developed to ac- cussed the relative computational complexity. At last, we celerate the simulation processes. Such like the Opti- P demonstrated the methods with an oscillation predator- mized Direct Method (ODM), computes aj incremen- prey model. tally. The Next Reaction Method (NRM), use one single random number for each simulation step. Others like Sort- 2. A brief introduction to the random time ap- ing Direct Method (SDM), Logarithmic Direct Method proach (LDM), Partial-propensity Stochastic Simulation Algo- rithm (PSSA), PSSA-composition rejection (PSSA-CR), In this part, we follow [2]. But we are trying to give a and Sorting Partial-propensity Direct Method (SPDM). rigorous generalized approach. Focused on different aspects of the simulation steps. Consider N > 1 chemical species in a well-mixed system. arXiv:2103.00405v1 [q-bio.MN] 28 Feb 2021 Development of hardware, especially for the General Also consider M > 1 reactions, labeled with k, the stoi- Purpose Graphics Processing Units (GPGPU), give the chiometric number of j-th chemical species in k-th reaction opportunity for developing parallel algorithm that can take is σjk. The simplest stochastic models for describing the advantage of the large number of thread blocks (TBs). system is a continuous time Markov chain. The state is Various data structure and implementations were also represented by the molecule number of chemical species been developed for acceleration of the stochastic - X(t) = {X1,X2,...,XN } and reactions modeled as possi- tion on fine-grid or coarse-grid level. These implemen- ble transitions of the chain. It was shown by Gillespie [6, 1] tations can achieve about 10 folds of simulation accelera- that in a well mixed system, the probability for a specific tion as compared to mature CPU based simulation toolkit, reaction takes place is governed by the propensity func- tion ak(X(t), t) when consider time-inhomogeneous chem- ical reaction networks. The propensity function is ∗To whom correspondence should be addressed. Email: [email protected] ak(X(t), t) = ck(t)hk(X(t)) (1)

Preprint submitted to March 2, 2021 where the time interval (t, t + ∆t] is

  P{R(t + ∆t) − R(t) = k|Ft} Y Xk(t) Y Xk(t)! hk(X(t)) = = (2) Z t+∆t σjk σjk!(Xk(t) − σjk)! j j = P{ dG(λ(s), s) = k|Ft} t = P{G(λ(t), t + ∆t) − G(λ(t), t) = k|Ft} when considering the time evolution , and the state vector (λ(t)∆t)k for k-th jump is vk. With Markov property, the conditional = e−λ(t)∆t probability can be expressed as: k! R t+∆t k λ s ds t+∆t t ( ) − R λ(s)ds = e t P{k-th reaction fired once in (t, t+∆t)|Ft} = ak(X(t), t)∆t k! (3) ( Z t+∆t ! Z t  ) where Ft is the σ-algebra representing the information = P Y λ(s)ds − Y λ(s)ds = k|Ft about the system that is available at time t [7]. The conse- t 0 quence of Eq. 3 is the possibility of two reactions occurred (8) at the same is on the order of (∆t)2, which means it is very where Y is unit rate Poisson process. The trick here is λ unlikely to happen. This can be shown by simply calculate is not changed in the infinitesimal time interval (t, t + ∆t]. the joint probability as Also it can be noticed Z t+∆t P{R t t − R t } lim λ(s)ds = lim λ(t)∆t (9) k( + ∆ ) k( ) > 2 ∆t→0 t ∆t→0 2 = P{Rk(t + ∆t) − Rk(t)} With Eq. 7, the result is obvious. Eq. 8 suggests ∼ P{Rk(t + ∆t) − Rk(t) > 1,Rj(t + ∆t) − Rj(t) > 1} Z t  P{R t t − R t }· P{R t t − R t } = k( + ∆ ) k( ) j( + ∆ ) j( ) R(t) = Y λ(s)ds (10) ∼ O((∆t)2) 0 (4) Next we can establish the relationship between λ and The system evolution can be described equivalently as a the propensity function ak. We can now rewrite the left counting process by replacing the random variables from side of Eq. 3 as chemical species molecule number to number of times that P {Rk(t + ∆t) − Rk(t) = 1|Ft} a specific reaction fired. If Rk(t) is the number of times that the k-th reaction has fired up to time t, then the state Z t+∆t ! −(λk(s)ds) at t that originated from X(0) is given by = λk(s)ds e t (11) Z t+∆t ! M 2 X = λk(s)ds + O((∆t) ) X(t) = X(0) + Rk(t)vk (5) t k=1 Compared to the right side of Eq. 3, we can have This counting process is a generalized Poisson process Z t+∆t G(λ(t), t) in the sense that the arrival rate λ is a function λk(s)ds = ak(X(t), t)∆t (12) of time t. The total arrival times for generalized Poisson t process G can be calculated as which means, λk(s) = ak(X(t), t) (13) Z t Therefore, from Eq. 5, Eq. 10 and Eq. 13, the system R(t) = dG(λ(s), s) (6) 0 evolution equation is,

M Z t  λ X The with parameter in the time in- X(t) = X(0) + Yk ak(X(s), s)ds vk (14) terval (s, t] describe the arrival times of events of a Poisson k=1 0 process N in this time interval. So we have represented the chemical reaction network evolution process as an increment counting process. Ev- (λ(s − t))k P{N(λ, s) − N(λ, t) = k} = e−λ(s−t) (7) ery infinitesimal time frame of this stochastic process can k! be further decomposed to M independence unit rate Pois- son processes. Eq. 14 can be rewritten to have the same Therefore the probability of k times events happened in formula with Eq. 5 by introducing the “internal time” for 2 each chemical reaction channel. The internal time Tk(t) is Eq. 21 tells us if a random number r is uniformly dis- defined as tributed on [0, 1], then the time interval is Z t Tk(t) = ak(X(s), s)ds (15) 0 ∆t = ln(1/r) (22)

And Eq. 5 is becoming In the random time representation of the stochastic model M of chemical reaction network, for every infinitesimal time X X(t) = X(0) + Yk (Tk(t)) vk (16) frame the stochastic process can be described by a count- k=1 ing process and further be decomposed into unit rate Pois- son processes. All these unit rate Poisson processes are in- This is where the random time notion comes from. dependent and remain stationary until some chemical re- action channel is fired. For k-th channel, follow the same 3. Random time simulation algorithm argument from Eq. 17 to Eq. 20, the internal time has the same distribution formula as Eq. 22. In order to perform a complete stochastic simulation, two things must be cleared firstly, ∆Tk(t) = ln (1/rk) (23)

1. how much time passed before one of the stochastic where rk is a random number uniformly distributed processes, Yk, fires; on [0, 1]. So if a set of random number is given 2. which Yk fires at that later time; {r1, r2, . . . , rM }, the corresponding real time for k-th If we view the state of chemical reaction network as points chemical channel can be calculated from Eq. 15. in multi-dimensional phase space, the time evolution of Z t+∆tk this model can be viewed as random walked in spatial cor- ∆Tk(t) = ak(X(s), s)ds (24) related non-homogeneous space. Then the two questions t above is the fundamental questions about space and time. Since all the decomposed processes are unit rate Poisson We will first determine the firing time problem. If we processes, event is expected to arrive in ∆Tk(t) for k-th denote Q(t, s) as the probability of no reaction occurred chemical reaction channel. The real time passed when k- in the time interval (s, t], then by the Markov property we th chemical reaction channel arrived is ∆tk. Accordingly have, the first fired chemical channel is the one that has the Q(t, s + ∆s) = Q(t, s)Q(s, s + ∆s) (17) smallest internal time. First arriving wins all. And the According to Eq. 3, the probability of no reaction occurred real time interval between two consequence reactions is in the time interval (s, s+∆s) is the prod of all independent reaction channels. ∆t = mink{∆tk} (25) M k Y Of course the firing chemical reaction channel is -th chan- Q(s, s + ∆s) = (1 − ak(X(s), s)∆s) (18) nel. Therefore we have answered the two main questions k=1 for stochastic simulations of the chemical reaction network using random time representation. Therefore,

M Y 4. Next reaction random time algorithm Q(t, s + ∆s) = Q(t, s) (1 − ak(X(s), s)∆s) k=1 According to the discussions in the former section, the M ! X 2 simulation procedure by applying the random time algo- = Q(t, s) 1 − ak(X(s), s)∆s + O((∆s) ) rithm can be listed as Table 1. k=1 One thing about this simulation algorithm is it demands (19) too many random numbers. For every time step, it needs By taking the limit of ∆s → 0, we arrive at the ordinary M random numbers to determine the time interval and differential equation, firing reaction channel. Usually a stochastic simulation M must be running for a long time to collect sufficient data dQ(t, s) X = −Q(t, s) ak(X(s), s) (20) points for statistical analysis. And the random number ds k=1 generators currently developed can only give a pseudoran- dom number sequence which will fail more likely in long Integral in the time interval (t, s], and notice ak is not sequences. The stochastic simulation algorithm that devel- changed in the time interval (t, s], we have oped by Gillespie [6] only demands two random numbers M ! in each time step. If the next reaction algorithm is used, X only one random number is demanded in one simulation Q(t, s) = exp − ak(X(t), t)(s − t) (21) k=1 loop. 3 However, it is possible to develop an algorithm that only allel computation, the network state is updated concur- need one random number for each simulation step. It can rently, so no dependency graph is used in the cuda algo- be noticed the only random number that is consumed in rithm. one simulation step is the one that trigger the firing of The key property for cuda programming is the memory a specific reaction channel. Other random numbers are arrangement, since memory access latency is about 20-30 untouched. This means the random numbers that do not times longer than arithmetic operations. Although mem- trigger reaction channel firing can be reused. As a matter ory access can be hidden partially by parallel execution of of fact, this discussion suggests that the internal time of warps by SMs, cooperate with the GPU memory caches is unfired reaction channel is not changed after the end of still vital. The utilizing of registers and shared memory this time frame. That is to say, if the triggered reaction contribute to more efficient latency hidding for parallel channel is the j-th channel, we can continue the simulation algorithm. For detail of this part, refer to (Professional by only refresh one internal time. CUDA C Programming, 2014). In the original cuda pro- gramming set, computing hierarchy is arranged as the grid- ∆Tj = ln (1/r) (26) block 2 layer structure. While for natural setup for parallel ∆Tk,j = ∆Tk,j simulation of biology network, an intermediate layer called the ”bundle” is used for holding an independent trajectory. where r is a random number uniformly distributed in [0, 1]. The reason for the need of this intermediate layer is the This random time reuse algorithm is much like the Next limiting of the threads that could reside in a particular Reaction Algorithm of stochastic simulation, so it is called block. Also for best performance, the threads per block the Next Reaction Random Time Algorithm (NRRTA). should better be 2n for efficient reduction. Given this con- The simulation procedure of NRRTA is shown as Table 2. siderations, the simulation architecture is arranged as the following. • 1 trajectory corresponds to 1 bundle 5. Implementations • bundle consists of a batch of blocks The most crucial and time consuming step with next re- • 1 thread in a block corresponds to 1 reaction related action random time algorithm is the selection of the min- computation imum inner time ∆tµ. For memory efficiency and algo- rithm simplicity, the algorithm was modified as outlined The best pesudo-random number generator we discover in Table 3. is the Mersenne Twister (MT) algorithm. With proper pa- The most accepted model file formate for current system rameter values, MT can generate sequence with a period biology is the SBML (System Biology Markup Language). as long as 219,937 and extremely good statistical properties. However, there is a lack of collective graphic view of the However, when implement MT in GPU architecture, the model file, making the edition of these files not convinient algorithm operation speed was limited by the bandwidth for ordinary use. We used a different model file formate of memory. For solving this issue, we had implemented an which can give both insight of the while network and also modified version of the GPU MT algorithm. The genera- provide all the informations for GSSA simulations. A typ- tion speed was increased but due to the limit of the size of ical predator oscillation system can be expressed as in Ta- shared memory, maximum 256 threads were allowed in a ble 4. single block, and totally 256 blocks were allowed in a sin- Since biology network follow the scale-free (or power- gle run. There are possibilities that the implementation law) topography, the reactant matrix, stochiometric ma- can be improved, but that involved with complete rewrite trix and species update matrix are intrinsically sparse. the algorithm in a self-made version which demand much Memory consuming is a potential issue for handling large effort and time. So we changed the idea of a pure GPU im- network which could consist of over 102 nodes, in other plementation to a heterogeneous version. The CPU were words, over 104 reactions. Sparse matrix is the natural responsible for the generation of random numbers and then choice for biological network simulation. For achieving passed to the global memory of GPU. This implementa- O(1) accessibiligy, we chose the COO (coordinate formated tion works fine for trajectory number and short sim- matrix) formate for sparse matrix storage and computa- ulation time. But when simulation trajectory number is tion. large or simulation time is very long, the global memory In a large network, sequentially update of all the reac- would be exhuasted. The situation can be solved with dif- tions after a firing process is not a wise idea. Therefore ferent streams of GPU computation. Depending on the the denpendency graph was proposed for the efficient up- computation latency of simulation, several streams can be date of the network state. The idea was first published performed to simulate different trajectories while at the by (Michael A. Gibson and Jehoshua Bruck, 2000, JPC). same time performing memory copy from CPU memory By following the network connection through the affected heap to GPU global memory. species, a dependency graph can be constructed as illus- We find shared memory are vital for GPU implementa- trated in the following graph. However, for massive par- tion in various ways. One problem of using shared memory 4 750 6. Discussions 700 650 The major time consuming step for Gillespie’s stochastic 600 simulation algorithm is the calculation of the time interval 550 between two consecutive reactions as well as the searching 500 for firing reaction channel. The searching step in random 450 Speed Up per Trajectory time simulation algorithm is solved by finding the reaction 400 channel that has the minimal expected real time. How- 350 ever the time consumed by calculating the time interval is 300 not much accelerated by the random time simulation algo- 250 100 200 300 400 500 600 700 800 900 1000 Number of Blocks rithm as compared with Gillespie’s simulation algorithm. For a time-inhomogeneous chemical reaction system the time interval for Gillespie’s stochastic algorithm can be Figure 1: Parallel next reaction random time stochastic simulation speedup of predator-prey model. Speed up of par- calculated from Eq. 1 and Eq. 22. allel version of next reaction random time stochastic simulation algo- M Z t+∆t rithm with predator-prey model. The computation was done using a X Y Xk(t)! graphic card of Nvidia RTX 1060 with capability 6.1, global memory ln (1/r) = ck(s) ds (27) σjk!(Xk(t) − σjk)! size 6GB. 10000 trajectories were run simultaneously, with 100000 k=1 t j steps for each trajectory. While for random time stochastic simulation algorithm, the time interval can be obtained by solving the following integral equation group.

Z t+∆tk is the memory size limitation. By using sparse represen- Y Xk(t)! ln (1/rk) = ck(s)ds, tation and limiting the biology network to maximum 2 σ X t − σ j jk!( k( ) jk)! t reactants, matrices were able to be fitted into the shared (28) memory of each block. In most biology network, the re- k = 1,...,M actants is less than 2. Even for reactions involved with ∆t = mink {∆tk} 3 reactants, the reaction can also be transformed into an equivalence 2 reactants formate. With this setup, the load- For large-scale chemical reaction network, the only hope ing from global memory can be performed coasleced, also for solving Eq. 27 is through numerical integration. While bank conflicing is avoided to some extent. for the random time simulation algorithm, the time evolu- tion is separated from the summation of propensity func- The most time consuming step in the GPU GSSA algo- tions, so integral can be calculated analytically for well rithm is the reduction step for finding the next reaction behaved time-dependent reaction rates. id and fire time. For classic reduction algorithm, the al- If we assume the number of molecules in the system gorithm were optimized for 2n elements reduction. For is on order of N, the number of reaction channel is on biology network, the reaction number can be arbitrarily order of M, the stoichiometric number is on order of H, valued, therefore the classic reduction algorithm should the number of reaction species for one single reaction is be modified to efficiently handle arbitrary number of el- on order of Q and the integral steps is on order of I. In ement. Another complexity is the reduction across dif- order to find ∆t, we apply binary tree searching to find the ferent blocks. Blocks within a bundle should be reduced next reaction, the element operation number is on order together, while blocks belong to different bundles should of R. So the number of element operations for calculate not be reduced. Therefore a two level reduction is needed. the propensity function is on order of 2NQ + 3Q. And With these considerations in mind, we developed an opti- the number of element operations for solving Eq. 27 is on mized version for block reduction and this kernel can also order of [MI(2NQ + 3Q) + M − 1] R. For Eq. 28, with be applied to bundle reduction as well. This part forms well behaved time-dependent perturbubation, the integral the loop unroll template function which is the core part of can be solved analytically, so the calculation is reduced the reduction computation. to a single element operation. The number of element operations for solving Eq. 28 is on order of [2NQ + 3Q + The performance of the GPU version of GSSA depends 2]M. For large-scale chemical reaction network, like 104 on the capability of the graphic card, also on the hier- chemical species, M is on the order of 108. While if we archy computation units of cuda kernel. In our current consider a real biology chemical reaction network, N is setup, when using Nvidia RTX 1060 with capability 6.1, typically on the order of 103. The stoichiometric number global memory size 6GB, about 400-500 folds speed-up can for a biology system is typically less than 10, and reaction be achieved with single-precision floating point number as species involved in one biochemical reaction is also less shown in Figure 1. The speed up can be expected to fur- than 10. The integral steps is related to the calculation ther increase when using a more powerful GPU. accuracy of the results since I = ∆t/δs, here we assume it 5 to be on order of 104. So the number of element operations for solving Eq. 27 is on order of 1016, while for Eq. 28 the number is reduced to an order of 1010. But the accelerations can not be the same of computed above. The main reason for not achieving the high speed include the memory limitation of the GPU hardware, the limit number of the computation core and also the limita- tion of total thread numbers. On the other side, for Eq. 27, the integration of propensity function can be made paral- lel. In this circumstances, the summation of contributions of propensity from all the reaction channels demands a re- duction algorithm, which followed by a searching process that locate the next reaction. However, the situation is very different in solving Eq. 28. Since no summation is demanded, the summation and searching are done within one single reduction process. This means at the most time consuming step, parallel random time algorithm can be ap- proximate 2 times faster than parallel Gillespie’s stochas- tic simulation algorithm. Hence random time based algo- rithms are more appreciate for parallel computations.

7. ACKNOWLEDGEMENTS

Chuanbo Liu thanks supports by Natural Science Foun- dation of China, No.32000888, Jin Wang thanks the supports from grant no. NSF-PHY 76066 and NSF- CHE-1808474, Ministry of Science and Technology of China, No.2016YFA0203200, Projects of Science and Tech- nology Development, Jilin, China, No.20180414005GH, Projects of Instrument and Equipment Development, Chi- nese Academy of Sciences, No.Y928041001.

References

[1] Gillespie,D.T. (1977) Exact stochastic simulation of coupled chemical reactions. J. Phys. Chem., 81(25), 2340-2361. [2] Anderson,D.F. and Kurtz,T.G. (2011) Continuous Time Markov Chain Models for Chemical Reaction Networks. In Koeppl,H., Setti,G., di Bernardo,M., Densmore,D. (eds), De- sign and Analysis of Biomolecular Circuits: Engineering Ap- proaches to and Synthetic Biology. Springer New York, New York, NY, pp. 3-42. [3] Dittamo, C., and Cangelosi, D. (2009). Optimized Paral- lel Implementation of Gillespie’s First Reaction Method on Graphics Processing Units. In 2009 International Conference on Modeling and Simulation (pp. 156-161). IEEE. https://doi.org/10.1109/ICCMS.2009.42 [4] Komarov, I., and D’Souza, R. M. (2012). Accelerating the Gillespie Exact Stochastic Simulation Algorithm Using Hybrid Parallel Execution on Graphics Processing Units. PLoS ONE, 7(11), e46693. https://doi.org/10.1371/journal.pone.0046693 [5] Tian, H., and Burrage, K. (2005). Parallel implementa- tion of stochastic simulation for large-scale cellular processes. In Proceedings - Eighth International Conference on High- Performance Computing in Asia-Pacific Region, HPC Asia 2005, pp. 621-626. [6] Gillespie, Daniel T. (1992). A rigorous derivation of the chemical master equation. Physica A, 188:404-425. [7] Kolmogorov, A. N. (1956). Foundations of the theory of proba- bility. Chelsea Publishing Co., New York. Translation edited by Nathan Morrison, with an added bibliography by A. T. Bharucha-Reid.

6 Table 1: Simulation procedure of random time algorithm initialize time t = 0 with state X(0) = {X1(0),X2(0),...,XN (0)} while (t < tmax) do: generate M random numbers rk ∼ U(0, 1), where k = 1,...,M set internal time interval ∆Tk = ln(1/rk) for all k = 1,...,M R t+∆tk compute ∆tk for all k = 1,...,M by solving t ak(X(s), s)ds = ∆Tk select reaction having ∆t = mink{∆tk}, assume j-th channel is triggered update time t = t + ∆t update state X(t + ∆t) = X(t) + vj end while

Table 2: Simulation procedure of Next Reaction Random Time Algorithm initialize time t = 0 with state X(0) = {X1(0),X2(0),...,XN (0)} generate M random numbers rk ∼ U(0, 1) set internal time intervals ∆Tk = ln(1/rk) for all k = 1,...,M while (t < tmax) do: R t+∆tk compute ∆tk for all k = 1,...,M by solving t ak(X(s), s)ds = ∆Tk select reaction having ∆t = mink{∆tk}, assume j-th channel is triggered update time t = t + ∆t update state X(t + ∆t) = X(t) + vj generate a random number r ∼ U(0, 1) refresh internal time interval: ∆Tj = ln (1/r), ∆Tk,j = ∆Tk,j end while

Notes: the boundary conditions can be achieved by restrict the firing channels. Those channels that are forbidden for firing is assigned a frozen internal time.

Table 3: Parallel next reaction random time algorithm. Initialize. Compute ak, set ∆tk = ln(1/rk)/ak, set pk = ak while (t < tmax) do: Scan for ∆tµ = mink{∆tk} Update species for reaction µ, set t = t + ∆tµ Recalculate ak for every reaction, set ∆tµ = ln(1/r)/aµ, set ∆tk = pk/ak(∆tk − ∆tµ) end while

Table 4: Model parser. // simple model of Lotka predator oscillation # init A = 1000 B = 1000 # reactions 1 A - -> 2 A : 10 1 A 1 B - -> 2 B : 0.01 1 B - -> NULL : 10 # end time = 2 steps = 10000

7