Parallel Implementation of the Treecode Ewald Method
Total Page:16
File Type:pdf, Size:1020Kb
Parallel Implementation of the Treecode Ewald Method Dongqing Liu*, Zhong-Hui Duan*, Robert Krasny†, Jianping Zhu‡§ * Department of Computer Science, The University of Akron, Akron, OH 44325, USA. † Department of Mathematics, University of Michigan, Ann Arbor, MI 48109, USA ‡ Department of Theoretical and Applied Mathematics, University of Akron, Akron, OH 44325, USA § Corresponding author: [email protected] Abstract achieved using a distributed parallel system with 5000 processors. To bridge the gap between the time interval In this paper, we discuss an efficient parallel of biological interest and that accessible by MD implementation of the treecode Ewald method for fast simulations, there has been extensive effort to improve evaluation of long-range Coulomb interactions in a the computational efficiency. One approach is to periodic system for molecular dynamics simulations. increase the time step size [4]. Another approach is to The parallelization is based on an adaptive decrease the time needed for the force evaluations at decomposition scheme using the Morton order of the each time step. This paper is concerned with the second particles. This decomposition scheme takes advantage approach. of the data locality and involves minimum changes to In an MD simulation of a system of N particles, the the original sequential code. The Message Passing computational cost is dominated by the frequent force Interface (MPI) is used for inter-processor evaluations. The calculation of long-range Coulomb communications, making the code portable to a variety interactions is the most time-consuming part of the of parallel computing platforms. We also discuss evaluations. A straightforward implementation of the 2 communication and performance models for our calculation requires O(N ) operations. In recent years, parallel algorithm. The predicted communication time several methods have been developed to reduce the and parallel performance from these models match the cost of computing Coulomb interactions while measured results well. Timing results obtained using a maintaining accuracy of N-body MD simulations, system of water molecules on the IA32 Cluster at the notably the multipole expansion based tree-codes and Ohio Supercomputer Center demonstrate high speedup Ewald summation based fast algorithms [5-20]. Two of and efficiency of the parallel treecode Ewald method. the earliest tree-codes were developed for problems involving gravitational interaction by Appel [7], Barnes and Hut [8]. Both algorithms employed monopole 1. Introduction approximations with a complexity of O(Nlog(N)). Appel used a cluster-cluster evaluation procedure, The rapid advancement in computational science while Barnes and Hut used a particle-cluster procedure. Greengard and Rokhlin's fast multipole method reduces has dramatically changed the way researchers conduct 2 scientific investigations, particularly in the studies of the operation count from O(N ) to O(N) [9]. To obtain molecular interactions in a biomolecular system [1]. higher accuracy, the fast multipole method uses higher However, the current modeling and simulation order multipole approximations and a procedure for capabilities available to researchers are still vastly evaluating a multipole approximation by converting it inadequate for the study of complex biomolecular into a local Taylor series. There is much ongoing effort systems, although there has been tremendous progress to optimize the performance of the fast multipole in computer hardware, software, modeling, and method. The latest version of the fast multipole method algorithm development [1, 2]. For example, the typical uses more sophisticated analytical techniques that time step in a molecular dynamics (MD) simulation is combine the use of exponential expansions and on the order of 10-15 second, but the time interval of multipole expansions [10]. The Ewald summation biological interest is typically on the order of 10-6 - 1 method has been widely used to handle Coulomb second. This translates to a simulation involving 109 - interactions for systems with periodic boundary 1015 time steps, where each step includes the conditions [15, 16]. The method converts a computation of short-range interactions and extremely conditionally convergent series into a sum of a constant time consuming long-range Coulomb interactions. As a and two rapidly convergent series, a real space sum and result, the longest simulation time that has ever been a reciprocal space sum. The relative computational cost of the two series is controlled by an adaptive splitting reported was 3.8´10-5 seconds [1, 3], which was parameter a. The popular particle-mesh Ewald method method splits the above conditionally convergent series [17] (PME) reduces the complexity of Ewald into a sum of a constant, E(0), and two rapidly summation by choosing a large value for a; the real convergent series, a real space sum E(r) and a reciprocal space sum is computed in O(N) and the cost of space sum E(k), (0) (r ) (k ) evaluating the reciprocal space sum is reduced from E = E + E + E 2 (2) O(N ) to O(Nlog N) using a particle-mesh interpolation where procedure and the fast Fourier transform (FFT). PME N E (0 ) = a q 2 was a major success in the field of biomolecular p 1 / 2 å j (3) computing, but parallelizing PME has been a great j =1 ' N N challenge due to the high communication cost of 3D erfc(a | ri - rj + Ln |) E (r ) = 1 q q FFT. With a relative small value of a.? the Ewald 2 å å å i j (4) n i=1 j=1 | ri - rj + Ln | summation based tree-code [18] approximates the real 2 1 æ p 2 | n |2 ö N æ 2pi ö space sum using a Barnes-Hut tree and the multipole E ( k) = 1 expç- ÷ q expç n× r ÷ (5) 2pL å 2 ç 2 2 ÷ å j j | n | L a j=1 è L ø expansion of the kernel erfc(|x|)/|x| and reduces the n¹0 è ø 2 computational cost from O(N ) to O(Nlog(N)). and a is a positive parameter. The complementary error While parallel implementations of hierarchical tree- function in the real space sum and the exponential codes, including the parallel fast multipole method and function in the reciprocal space sum decay rapidly with the parallel Barnes-Hut tree based multipole method, the index n, therefore cutoffs rc and kc can be used to have been extensively studied [21-26], the parallel compute the Ewald sum, i.e. only terms satisfying implementation of the Ewald summation based | ri – rj + Ln| £ rc (6) treecode [18] for fast evaluation of long-range for E(r) and Coulomb interactions in a periodic system has never |n| £ kc (7) been reported. In this paper, we present an efficient for E(k) are retained for the computation. The magnitude parallel implementation of this method. We describe an of the Ewald parameter a controls the relative rates of adaptive decomposition scheme based on the Morton convergence of the real space sum and the reciprocal order of the particles. This decomposition scheme takes space sum. When a is large, E(r) converges rapidly and advantage of the data locality and involves minimum can be evaluated to a given accuracy in O(N) changes to the original sequential code. We then operations using an appropriate cutoff rc; however in discuss Amdahl’s law and illustrate the performance this case E(k) converges slowly and O(N2) operations improvement gained by making a common sequential are required since the cutoff kc must be large enough to part faster. The parallel algorithm has been attain the desired accuracy. The situation is reversed implemented using the Message Passing Interface when a is small, and therefore in either case, the (MPI), making it portable to a variety of parallel classical Ewald method requires O(N2) operations. The platforms [27, 28]. We then report and discuss detailed cost can be reduced to O(N3/2) by optimizing the numerical results based on the force evaluation for a parameters a, rc, kc as a function of N [16]. The hidden water system on the IA32 Cluster at the Ohio constant in front of N3/2 can be further reduced using Supercomputer Center. The next section describes the the linked-cell method to reduce the computational cost Ewald summation based treecode. The parallel of locating the particles that are within the cutoff radius implementation of the method and timing results are of a given particle [6]. given in Section 3 and Section 4, respectively, The Ewald summation based multipole method is a followed by the conclusions in Section 5. tree-code [18]. A typical tree-code has three basic features: (a) the particles are divided into nested 2. Treecode Ewald method clusters, (b) the far-field influence of a cluster is approximated using a multipole expansion, and (c) a The total electrostatic energy of a periodic system of recursive procedure is applied to evaluate the required N particles can be described in terms of a conditionally force or potential. The Ewald summation based convergent series: treecode uses an oct-tree data structure and Cartesian ' N N 1 E = 1 q q multipole expansions to approximate the real space 2 å å å i j (1) n i=1 j =1 | ri - r j + L n | sum in the Ewald summation. With a relatively small value for a and an appropriate cutoff rc, the method where qi is the charge of particle i, L is the size of the reduces the computational complexity of the real space simulation box, ri is the position of particle i, |r| part from O(N2) to O(Nlog(N)). The reciprocal part can denotes the Euclidean norm, n = (n1, n2, n3) are be computed in O(N) operations using a cutoff kc. integers, and the prime indicates that the i=j terms are In the Ewald summation based treecode, the oct-tree omitted when n=0.