1 2 Noname manuscript No. 3 (will be inserted by the editor) 4 5 6 7 8 9 Parallel simulation of Brownian dynamics on shared 10 memory systems with OpenMP and Unified Parallel 11 12 · · 13 Carlos Teijeiro Godehard Sutmann · 14 Guillermo L. Taboada Juan Touri˜no 15 16 17 18 19 Received: date / Accepted: date 20 21 22 Abstract The simulation of particle dynamics is an essential method to ana- 23 lyze and predict the behavior of molecules in a given medium. This work 24 presents the design and implementation of a parallel simulation of Brownian 25 dynamics with hydrodynamic interactions for systems using 26 two approaches: (1) OpenMP directives and (2) the Partitioned Global Ad- 27 dress Space (PGAS) paradigm with the Unified Parallel C (UPC) language. 28 The structure of the code is described, and different techniques for work distri- 29 bution are analyzed in terms of efficiency, in order to select the most suitable 30 strategy for each part of the simulation. Additionally, performance results have 31 been collected from two representative NUMA systems, and they are studied 32 and compared against the original sequential code. 33 34 Keywords Brownian dynamics parallel simulation OpenMP PGAS 35 UPC shared memory · · · · 36 · 37 38 1Introduction 39 40 Particle based simulation methods have been continuously used in physics 41 and biology to model the behavior of different elements (e.g., molecules, cells) 42 in a medium (e.g., fluid, gas) under thermodynamical conditions (e.g., tem- 43 perature, density). These methods represent a simplification of the real-world 44 45 C. Teijeiro, G. L. Taboada, J. Touri˜no 46 Computer Architecture Group, University of A Coru˜na 47 15071 A Coru˜na, Spain 48 E-mail: {cteijeiro,taboada,juan}@udc.es 49 G. Sutmann 50 J¨ulich Supercomputing Centre (JSC), Institute for Advanced Simulation 51 Forschungszentrum J¨ulich GmbH 52425 J¨ulich, Germany 52 ICAMS, Ruhr-University Bochum, -44801 Bochum, Germany 53 E-mail: [email protected] 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 2 Carlos Teijeiro et al. 5 6 scenario but often provide enough details to model and predict the state and 7 evolution of a system on a given time and length scale. Among them, Brownian 8 dynamics describes the movement of particles in solution on a diffusive time 9 scale, where the solvent particles are taken into account only by their statisti- 10 cal properties, i.e. the interactions between solute and solvent are modeled as 11 a stochastic . 12 The main goal of this work is to provide a clear and efficient paralleliza- 13 tion for the Brownian dynamics simulation, using two different approaches: 14 OpenMP and the PGAS paradigm. On the one hand, OpenMP facilitates the 15 parallelization of codes by simply introducing directives (pragmas) to create 16 parallel regions in the code (with private and shared variables) and distribute 17 their associated workload between threads. Here the access to shared variables 18 is concurrent for all threads, thus the should control possible 19 data dependencies and . On the other hand, PGAS is an emerging 20 21 paradigm that treats memory as a global address space divided in private and 22 shared areas. The shared area is logically partitioned in chunks with affinity 23 to different threads. Using the UPC language [15] (which extends ANSI C 24 including PGAS support) the programmer can deal directly with workload 25 distribution and data storage by means of different constructs, such as assign- 26 ments to shared variables, collective functions or parallel loop definitions. 27 The rest of the paper presents the design and implementation of the parallel 28 code. First, some related work on this area is presented, and then amathe- 29 matical and computational description of the different parts in the sequential 30 code is provided. Next, the most relevant design decisions taken forthepa- 31 rallel implementation of the code, according to the nature of the problem, are 32 discussed. Finally, a performance analysis of the parallel code is accomplished, 33 and the main conclusions are drawn. 34 35 36 2Relatedwork 37 38 39 There are multiple works on simulations based on Brownian dynamics, e.g. 40 several simulation tools such as BrownDye [7] and the BROWNFLEX pro- 41 gram included in the SIMUFLEX suite [5], and many other studies oriented 42 to specific fields, such as DNA research [8] or copolymers [10]. Although most 43 of these works focus on sequential codes, some of them have developed parallel 44 implementations with OpenMP, applied to smoothed particle hydrodynamics 45 (SPH) [6] and the transportation of particles in a fluid [16]. The UPC language 46 has also been used in a few works related to particle simulations, such as the 47 implementation of the particle-in-cell algorithm [12]. However, comparative 48 analyses between parallel programming approaches on shared memory with 49 OpenMP and UPC are mainly restricted to some computational kernels for 50 different purposes [11,17], without considering large simulation applications. 51 The most relevant recent work on parallel Brownian dynamics simulation 52 is BD BOX [2], which supports simulations on CPU using codes written with 53 MPI and OpenMP, and also on GPU with CUDA. However, there is still little 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 Parallel Brownian dynamics on shared memory 3 5 6 information published about the actual algorithmic implementation of these 7 simulations, and their performance has not been thoroughly studied, espe- 8 cially under periodic boundary conditions. In order to solve this and provide 9 an efficient parallel solution on different NUMA shared memory systems, we 10 have analyzed the Brownian dynamics simulation for different problem sizes, 11 and described how to address the main performance issues of the code using 12 OpenMP and a PGAS-based approach with UPC. 13 14 15 3Theoreticaldescriptionofthesimulation 16 17 The present work focuses on the simulation of Brownian motion of particles in- 18 cluding hydrodynamic interactions. The time evolution in configurationspace 19 has been stated by Ermak and McCammon [3] (based on the Focker-Planck 20 and Langevin descriptions) which includes the direct interactions between par- 21 22 ticles via systematic forces as well as solvent mediated effects via a correlated 23 random process 24 ∂Dij (t) 1 25 r (t + ∆t)=r (t)+ ∆t + D (t)F (t)∆t + R (t + ∆t)(1) i i ! ∂r ! k T ij j i 26 j j j B 27 28 Here the underlying stochastic process can be discretized in different ways, 29 leading to e.g. the Milstein [9] or Runge-Kutta [14] schemes, but in Eq. 1 30 the time integration method follows a simple Euler scheme. In a system that 31 contains N particles, the trajectory ri(t); t [0,tmax] of particle i is cal- { ∈ } 32 culated as a succession of small and fixed time step increments ∆t,defined 33 as the sum of interactions on the particles during each time step (that is, 0 34 the partial derivative of the initial diffusion tensor Dij with respect to the 35 position component and the product of the forces F and the diffusion ten- 36 sor at the beginning of a time step, being kB Boltzmann’s constant and T 37 the absolute temperature), and a random displacement Ri(∆t) associated to 38 the possible correlations between displacements for each particle. The random 39 displacement vector R is Gaussian distributed with average value Ri =0 40 and covariance R (t + ∆t)RT (t + ∆t) =2D (t)∆t. The diffusion" tensor# 41 i j ij D thereby contains" information about hydrodynamic# interactions, which, for- 42 mally, can be neglected by considering only the diagonal components D . 43 ii 44 Choosing appropriate models for the diffusion tensor, the partial derivative of 45 D drops out of Eq. 1, which is true for the Rotne-Prager tensor [13] used in 46 this work. Therefore, Eq. 1 reduces to 47 1 48 ∆r = DF∆t + √2∆t Sξ (2) kT 49 50 where the expression R = √2∆tSξ holds and D = SST , which relates the 51 stochastic process to the diffusion matrix. Therefore, S may be calculated via a 52 Cholesky decomposition of D or via the square root of D.Bothapproachesare 53 3 very CPU time consuming and have a complexity of (N ). A faster approach 54 O 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 4 Carlos Teijeiro et al. 5 6 that approximates the random displacement vector R without constructing S 7 explicitly was introduced by Fixman [4] using an expansion of R in terms of 8 Chebyshev polynomials, which has a complexity of (N 2.25). O 9 The interaction model for the computation of forces in the system (which 10 here considers periodic boundary conditions) consists of (1) direct systematic 11 interactions, which are computed as Lennard-Jones type interactions and eval- 12 uated within a fixed cutoffradius, and (2) hydrodynamic interactions, which 13 have long range character. The latter ones are reflected in the off-diagonal 14 components of the diffusion tensor D and therefore contribute to both terms 15 of the right-hand side of Eq. 2. Due to the long range nature, the diffusion 16 tensor is calculated via the Ewald summation method [1]. 17 18 19 4Designoftheparallelsimulation 20 21 The sequential code considers a unity 3-dimensional box with periodic bound- 22 ary conditions that contains a set of N equal particles. The main part of this 23 code is a for loop in which each iteration represents a time step. Following 24 the formulas presented in Section 3, each time step of the simulation consists 25 of two main parts: (1) the calculation of the forces that act on each particle 26 (function calc force), and (2) the computation of random displacements for 27 every particle in the system (function covar). 28 29 The main loop of time steps is not parallelizable, as each iteration depends 30 on the previous one. However, the workload of each iteration can beexecuted 31 in parallel through a domain decomposition of the work. Here OpenMP re- 32 quires separate parallel regions for each function, as a work distribution is only 33 possible if the parallel region is defined inside the target function; otherwise, 34 its arguments would be treated mandatorily as private variables. Nevertheless, 35 UPC targets a balanced workload distribution between threads and an efficient 36 data access, exploiting the NUMA architecture through the maximization of 37 data locality (processing the data stored in the ’s affine space). A more 38 detailed description of both OpenMP and UPC parallelizations is given in the 39 next subsections. 40 41 42 4.1 Calculation of interactions 43 44 The main part of the computation in function calc force consists in updating 45 the diffusion tensor D, which is defined in the code as a symmetric square 46 pNDIM pNDIM matrix (called D), where pNDIM=3 N (that is, one value per 47 dimension× and per pair of particles). For each element× in matrix D, the Ewald 48 summation is split into (1) a short range part, where elements are calculated 49 according to pairwise distances within a cutoffradius, and (2) a long range 50 part, which is evaluated in Fourier k-space. Both computations are performed 51 individually for each value in matrix D using the current coordinates for each 52 particle in the system with two separate loops. In the sequential code, these 53 loops operate only on the upper triangular part of D, because of its symmetry. 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 Parallel Brownian dynamics on shared memory 5 5 6 According to the previous definitions, the independent computationofva- 7 lues of matrix D defines a straightforward parallelization for calc force in the 8 simulation. The OpenMP implementation uses an omp for to paral- 9 lelize the computation of short and long range interactions, using a dynamic 10 scheduling of iterations because of the different amount of values calculated 11 per particle/dimension, alongside with atomic directives to obtain the aver- 12 age overall speed and pressure values. The UPC code performs an analogous 13 processing, but using a 1D storage for matrix D: its elements are divided in 14 THREADS equal chunks (being THREADS the number of threads in the UPC pro- 15 gram), and each thread computes its associated values for each particle for all 16 dimensions. 17 18 19 20 4.2 Computation of random displacements 21 22 The random displacements are calculated for each particle as Gaussian random 23 numbers with prescribed covariances according to the diffusion tensor matrix. 24 The simulation code includes two alternative methods to obtain the covariance 25 terms: (1) a Cholesky decomposition of D and (2) the Fixman’s approximation 26 method. 27 The Cholesky decomposition of matrix D obtains a lower triangular matrix 28 L, whose values for each row are multiplied by a random value generatedfor 29 the corresponding particle and dimension. The final result is the sum of all 30 these products, as stated in Eq. 3, where row i represents particle i div DIM 31 in dimension i mod DIM (with DIM= 3, the number of dimensions), and xic is 32 the array of pNDIM random values: 33 34 i 35 Ri(t + ∆t)=! L[i][j] xic[j](3) · 36 j=1 37 38 The original sequential algorithm calculates the first three values ofthe 39 lower triangular result matrix (that is, the elements of the first two rows), 40 and then the remaining values are computed row by row in a loop. Here the 41 data dependencies between the values in L computed in each iteration and the 42 ones computed in previous iterations prevent the direct parallelization of the 43 loop. However, UPC takes advantage of its data manipulation facilitiesand 44 computes the values in L as a sum (reduction) of different contributions cal- 45 culated by each thread: after a row in matrix L is computed, its elements are 46 used to obtain partial computations of the random displacements associated 47 to the unprocessed rows, thus maximizing parallelism. This functionality is 48 implemented in UPC using the upc forall construct to distribute the loop 49 iterations between threads, and a broadcast collective to send partial compu- 50 tations. Listing 1 presents a pseudocode of this algorithm. 51 Regarding OpenMP, and given the difficulties for an efficient paralleliza- 52 tion of the original algorithm commented previously, an alternative iterative 53 algorithm is proposed: the first column of the result matrix is calculated by 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 6 Carlos Teijeiro et al. 5 for every row ‘i ’ in matrix D 6 if the row is assigned to MYTHREAD 7 obtain values of row ‘i ’ in L 8 calculate random displacement ‘i ’ 9 endif 10 11 broadcast values in row ‘i’ to all threads 12 upc forall rows ‘j ’ located after ‘i ’ in matrix D 13 calculate partial values for elements in row ‘j ’ 14 calculate partial value for displacement ‘j ’ 15 endfor 16 endfor 17 Listing 1 UPC pseudocode for Cholesky decomposition 18 19 20 dividing the values of the first row in the source matrix by its diagonal value 21 (parallelizable for loop), then these values are used to compute a partial con- 22 tribution to the rest of elements in the matrix (also parallelizable for loop). 23 Finally, these two steps are repeated to obtain the rest of columns in the result 24 matrix. Listing 2 shows the source code of the iterative algorithm: a static 25 scheduling is used in the first two loops, whereas the last two use a dynamic 26 one because of the different workload associated to their iterations. 27 28 L[0][0] = sqrt(D[0][0]); 29 #pragma omp parallel private(i,j,k) 30 { 31 #pragma omp for schedule(static) 32 for (j=1; j

31 Execution time (s) 4 32 2 33 0 34 1 2 4 8 16 24 35 Number of Threads 36 37 Brownian Dynamics - 128 particles 38 14 39 OpenMP NEH (Fixman) OpenMP NEH (Cholesky) 40 12 OpenMP MGC (Fixman) 41 OpenMP MGC (Cholesky) 10 UPC NEH (Fixman) 42 UPC NEH (Cholesky) 43 8 UPC MGC (Fixman) 44 UPC MGC (Cholesky) 45 6 Speedup 46 4 47 48 2 49 0 50 1 2 4 8 16 24 51 Number of Threads 52 53 Figure 1 Performance results with 128 particles (Fixman and Cholesky) 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 12 Carlos Teijeiro et al. 5 6 Figure 2 shows the simulation results for a large number of particles (4096) 7 and 5 time steps. Only the results of covar with Fixman’s algorithm are 8 shown, because of its higher performance. The results up to 8 threads present 9 similar performance, which is mainly due to the heavy computation of long 10 range interactions compared to the relatively small computation of random 11 displacements. In fact, the larger problem size provides higher speedups than 12 those of Figure 1. However, the algorithmic complexity of calculating the dif- 13 fusion tensor D is (N 2), whereas Fixman’s algorithm is (N 2.25); thus, 14 when the problem sizeO increases, the generation of random displaceO ments rep- 15 resents a larger percentage of the total simulation time. As a result of this, and 16 also given the parallelization issues commented in Section 4.2, the speedup is 17 slightly limited for 16 or more threads, mainly for OpenMP (also due to the 18 use of SMT in NEH). However, considering the distance to the ideal speedup, 19 we can conclude that both systems present reasonably good speedups for this 20 21 code. 22 23 24 Brownian Dynamics - 4096 particles - Fixman 25 1400 1300 OpenMP NEH 26 OpenMP MGC 27 1200 UPC NEH 1100 UPC MGC 28 1000 29 900 800 30 700 31 600 32 500 400

33 Execution time (s) 300 34 200 100 35 0 36 1 2 4 8 16 24 37 Number of Threads 38 39 Brownian Dynamics - 4096 particles - Fixman 24 40 OpenMP NEH 22 OpenMP MGC 41 20 UPC NEH 42 18 UPC MGC 43 16 44 14 45 12 10 Speedup 46 8 47 6 48 4 49 2 0 50 1 2 4 8 16 24 51 Number of Threads 52 53 Figure 2 Performance results with 4096 particles (Fixman) 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 Parallel Brownian dynamics on shared memory 13 5 6 6Conclusions 7 8 This work has presented the parallel implementation of a Brownian dynamics 9 simulation code on shared memory using OpenMP directives and a PGAS- 10 based approach with the UPC language. According to both programming 11 paradigms, different workload distributions have been presented and tested 12 for the main functions in the code, trying to obtain good performance as the 13 number of threads increases. The scheduling techniques in for loops, the con- 14 venient use of reduction or critical directives, and the control of the stack 15 size in parallel function calls have been the most relevant features totunethe 16 OpenMP code, whereas the efficient work distribution and a wise optimization 17 of data management have been key factors for UPC. The results of the paral- 18 lel implementation have shown that there is a high dependency of the code 19 on the part where random displacements are generated for each particle, and 20 here Fixman’s approximation method has shown to be the best choice.Re- 21 22 garding both programming paradigms, UPC has obtained better results than 23 OpenMP, mainly because of the more efficient access to the data set,asthe 24 PGAS memory model presents a more straightforward adaptation to current 25 NUMA multicore-based systems, although at the cost of a slightly higher pro- 26 gramming effort. OpenMP has provided a more compact implementation in 27 terms of SLOCs and a more straightforward approach to parallel programming 28 but, in general, the trade-offbetween programmability and performance has 29 been favorable for the UPC code. 30 31 32 Acknowledgements 33 34 This work was funded by the Ministry of Science and Innovation of Spain 35 36 (Project TIN2010-16735), the Spanish network CAPAP-H3 (Project TIN2010- 37 12011-E), and by the Galician Government (Pre-doctoral Program “Mar´ıa 38 Barbeito” and Program for the Consolidation of Competitive Research Groups, 39 ref. 2010/6). 40 41 42 References 43 44 1. Beenakker, C.: Ewald Sum of the Rotne-Prager Tensor. Journal of Chemical Physics 45 85,1581–1582(1986) 46 2. Dlugosz,! M., Zieli´nski, P., Trylska, J.: Brownian Dynamics Simulations on CPU and 47 GPU with BD BOX. Journal of Computational Chemistry 32(12), 2734–2744 (2011) 3. Ermak, D.L., McCammon, J.A.: Brownian Dynamics with Hydrodynamic Interactions. 48 Journal of Chemical Physics 69(4), 1352–1360 (1978) 49 4. Fixman, M.: Construction of Langevin Forces in the Simulation of Hydrodynamic In- 50 teraction. Macromolecules 19(4), 1204–1207 (1986) 51 5. Garc´ıa de la Torre, J., Hern´andez Cifre, J.G., Ortega, A., Schmidt, R.R., Fernandes, 52 M.X., P´erez S´anchez, H.E., Pamies, R.: SIMUFLEX: Algorithms and Tools for Simu- 53 lation of the Conformation and Dynamics of Flexible Molecules and Nanoparticles in Dilute Solution. Journal of Chemical Theory and Computation 5(10), 2606–2618 (2009) 54 55 56 57 58 59 60 61 62 63 64 65 1 2 3 4 14 Carlos Teijeiro et al. 5 6 6. Hipp, M., Rosenstiel, W.: Parallel Hybrid Particle Simulations Using MPI and OpenMP. 7 In: 10th International Euro-Par Conference (Euro-Par’04),Pisa(Italy),pp.189–197 (2004) 8 7. Huber, G.A., McCammon, J.A.: Browndye: A Software PackageforBrownianDynamics. 9 Computer Physics Communications 181(11), 1896 – 1905 (2010) 10 8. Klenin, K., Merlitz, H., Langowski, J.: A Brownian Dynamics Program for the Simu- 11 lation of Linear and Circular DNA and Other Wormlike Chain Polyelectrolytes. Bio- 12 physical Journal 74(2 Pt 1), 780–788 (1998) 9. Kloeden, P.E., Platen, E.: Numerical Solution of Stochastic Differential Equations. 13 Springer, Berlin (1992) 14 10. Li, B., Zhu, Y.L., Liu, H., Lu, Z.Y.: Brownian Dynamics Simulation Study on the 15 Self-assembly of Incompatible Star-like Block Copolymers in Dilute Solution. Physical 16 Chemistry Chemical Physics 14,4964–4970(2012) 17 11. Mall´on, D.A., Taboada, G.L., Teijeiro, C., Touri˜no, J., Fraguela, B.B., G´omez, A., Doallo, R., Mouri˜no, J.C.: Performance Evaluation of MPI, UPC and OpenMP on 18 Multicore Architectures. In: 16th European PVM/MPI Users’ Group Meeting (Eu- 19 roPVM/MPI’09), Espoo (Finland), pp. 174–184 (2009) 20 12. Markidis, S., Lapenta, G.: Development and Performance Analysis of a UPC Particle- 21 in-Cell Code. In: 4th Conference on Partitioned Global Address Space Programming 22 Models (PGAS’10), New York (NY, USA), pp. 10:1–10:9. ACM (2010) 13. Rotne, J., Prager, S.: Variational Treatment of Hydrodynamic Interaction in Polymers. 23 Journal of Chemical Physics 50(11), 4831–4837 (1969) 24 14. Tocino, A., Vigo-Aguiar, J.: New Itˆo–Taylor Expansions. Journal of Computational and 25 Applied Mathematics 158(1), 169–185 (2003) 26 15. Unified Parallel C at George Washington University: http://upc.gwu.edu. Accessed 31 27 October 2012 16. Wei, W., Al-Khayat, O., Cai, X.: An OpenMP-enabled Parallel Simulator for Particle 28 Transport in Fluid Flows. In: 11th International ConferenceonComputationalScience 29 (ICCS’11), Singapore, pp. 1475–1484 (2011) 30 17. Zhao, W., Wang, Z.: ScaleUPC: A UPC Compiler for Multi-core Systems. In: 3rd Con- 31 ference on Partitioned Global Address Space Programing Models (PGAS’09), Ashburn 32 (VA, USA), pp. 11:1–11:8. ACM (2009) 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65