Parallelization of a Dissipative Particle Dynamics Application in a Partitioned Global Address Space Environment

THESIS

Presented in Partial Fulfillment of the Requirements for the Degree Master of Science in the Graduate School of The Ohio State University

By

Karthik Raj Saanthalingam

Graduate Program in Computer Science and Engineering

The Ohio State University

2012

Dissertation Committee:

Dr. Ponnuswamy Sadayappan, Advisor

Dr. Atanas Nasko Rountev

Copyright by

Karthik Raj Saanthalingam

2012

Abstract

Molecular dynamics simulation provides the methodology for detailed microscopic modeling on the molecular scale. The nature of matter is to be found in the structure and motion of its constituent building blocks, and the dynamics is contained in the solution to the N-body problem. Given that the classical N-body problem lacks a general analytical solution, the only path open is the numerical one. Scientists engaged in studying matter at this level require computational tools to allow them to follow the movement of individual molecules and it is this need that the molecular dynamics approach aims to fulfill.

The Molecular Dynamics method follows a constructive approach by trying to reproduce the microscopic behavior of matter using model systems rather than deduce the behavior directly from experiment. Dissipative particle dynamics (DPD) is a stochastic simulation method for soft materials and has been applied to a variety of simulations. Doubts about its adequacy due to upper coarse-graining limitations, which could prevent the method from being applicable to the whole mesoscopic range has led to the proposal for a modified coarse-grained level tunable DPD method that demonstrates its performance for linear polymeric systems. The proposed method models the system through the interaction between the particles as a result of interparticle forces.

The calculation region size is non-trivial and can vary depending on the simulation strategy adopted and hence greatly affects the performance of the application.

Simulations of very large systems, approaching a cubic micron for milliseconds, are ii possible using a parallel implementation of DPD running on multiple processors. Because of the short-range nature of forces in DPD an efficient way to parallelize the application is to adopt a special decomposition technique. In this scheme, the total simulation space is divided into a number of cuboidal regions which are distributed across the processes in the cluster. Each processor is responsible for integrating the equations of motion of all particles that lie within its region of space. Only particles lying near the boundaries of each processor's space require communication between processors. In order to ensure that the simulation is efficient, the crucial requirement is that the number of particle-particle interactions that require inter-processor communication be much smaller than the number of particle-particle interactions within the bulk of each processor's region of space.

In this thesis we outline the various alternatives to increase the per-node utilization and reduce the communication bottlenecks in each time step of the computation so as to achieve a near uniform distribution of work among the processes. A partitioned global address space (PGAS) model is deployed in the parallelization strategy so that the system can be logically portioned among the processors. The novelty of PGAS is that the portions of the space may have an affinity for a particular , thereby exploiting locality of reference and thereby reducing the communication overhead.

iii

Dedication

This document is dedicated to my family.

iv

Acknowledgments

Firstly, I would like to express my sincere gratitude to my advisor Dr. P. Sadayappan and

Dr. David Hudak for their continuous support of my Masters study and research. They have been a great motivation for me throughout and their guidance has helped with my

Masters. Their constant encouragement and complete knowledge of the subject has made me learn and explore more about my research.

My sincere thanks also goes to my thesis committee member Dr. Atanas Nasko Rountev, for his encouragement and support.

I would like to thank John Eisenlohr and Mikio Yamanoi for all their motivation, guidance and help throughout my thesis. Also, my thanks to all my other lab members for their support.

Finally, I would like to extend to my gratitude to my family and friends for all their help and motivation, without whom this Masters would not have been possible.

v

Vita

2004...... B.Tech., Information Technology,

Anna University,

Chennai, India.

2010 to 2012 ...... Masters Student,

Department of Computer Science and

Engineering,

The Ohio State University.

Fields of Study

Major Field: Computer Science and Engineering

vi

Table of Contents

Abstract ...... ii

Dedication ...... iv

Acknowledgments...... v

Vita...... vi

Fields of Study ...... vi

Table of Contents ...... vii

List of Tables ...... ix

List of Figures ...... x

Chapter 1: Introduction ...... 1

1.1 Theoretical Foundation ...... 1

1.2 Molecular Dynamics ...... 2

1.3 Dissipative Particle Dynamics ...... 4

1.4 Specific Goals ...... 6

1.5 Contributions ...... 7

1.6 Organization ...... 8

Chapter 2: System Design...... 9

vii

2.1 Background ...... 9

2.2 Global Arrays ...... 9

2.3 Overview ...... 10

2.4 Programming Model ...... 12

2.5 Overview of the System ...... 13

2.6 DPD Parallelization in Global Arrays ...... 20

2.6.1 Neighbor List ...... 22

2.6.2 Force Calculation ...... 25

2.7 DPD Parallelization in OpenMP ...... 27

2.7.1 Alternative approach ...... 30

Chapter 3: Experimental Evaluation ...... 33

3.1 Performance Tests ...... 33

3.2 Summary ...... 42

Chapter 4: Conclusions ...... 43

4.1 Conclusion...... 43

Bibliography...... 44

viii

List of Tables

Table 3.1: Sample system attributes ...... 34

ix

List of Figures

Figure 1.1: Molecular Dynamics Simulation ...... 3

Figure 1.2: Illustration of Dissipative Particle Dynamics approach ...... 6

Figure 2.1: GA Programming Model ...... 12

Figure 2.2: The different approaches to computing interactions [1] ...... 18

Figure 2.3: Parallel global sum ...... 32

Figure 3.1: Performance improvement through Serial Optimizations and Vectorization 35

Figure 3.2: Performance improvement through Serial Optimizations and Vectorization 36

Figure 3.3: Performance improvement through loop transformation ...... 36

Figure 3.4: Glenn , neighbor list vectorized ...... 37

Figure 3.5: Oakley speedup, neighbor list vectorized...... 38

Figure 3.6: Runtime comparison between Glenn and Oakley ...... 38

Figure 3.7: Glenn scaling factor – neighbor list redistributed ...... 39

Figure 3.8: Oakley scaling factor – neighbor list redistributed ...... 40

Figure 3.9: Local vs. Redistributed neighbor list runtime comparison...... 40

Figure 3.10: OpenMP vs. Global Arrays performance comparison ...... 41

x

Chapter 1: Introduction

1.1 Theoretical Foundation

Computer simulations have become an inevitable research tool in modern physics, given the scope of modern applications. The reason being that the theoretical description of nature is expressed in mathematical equations, which in most cases cannot be solved exactly. Hence the majority of the non-trivial cases have to be solved through approximations which may either be analytical or numerical in order to obtain predictions from the theoretical models. While theories are the basis for comprehending nature, experiments provide the observations that are to be comprehended. Therefore, it is essential to compare the theoretical predictions against the experimental results in order to verify the validity of the theoretical models. The concrete effects of the approximation are often uncontrollable in nature, such that deviations between theory and experiment are difficult to assess whether they are caused by the approximations or not thus leading to the fact that the verification of approximate theories against the experimental observations is rather unreliable.

Computer simulations serve as a bridge between theory and experiment and can help solve complex problems without having to rely on approximations achieved through the ever increasing computational capabilities of modern computers. The exact results 1 obtained through such simulations could be compared against the predictions of approximate theories and thus serve to test the reliability of theories. On the other hand, simulation results can be compared with experimental measurements and thus serve as a test of models.

A computer simulation provides the connection between the microscopic details of a model and the macroscopic properties of interest. In places where it‟s difficult or even impossible to obtain experimental measurements the computer can function as a virtual

„laboratory‟, where perfect control of all parameters is possible and accurate

„measurements‟ can be acquired. The Molecular Dynamics (MD) method solves

Newton‟s equations of motion and obtains the trajectory of the system in discrete time steps.

1.2 Molecular Dynamics

In Molecular Dynamics, the time evolution of a system is simulated by numerically integrating Newton‟s equations of motion. The integration schemes are derived from a series expansion. The more terms are incorporated, the more exact will the algorithm be, but would adversely affect the efficiency.

Most time consuming part of a computer simulation is the calculation of all the forces in the system [1]. It is desirable to be able to use the algorithm with a large time step in order to have a minimum of force calculations for a certain simulation time. On the other hand, a large time step increases the errors introduced by the discretization of the equations of motion. Moreover, there are algorithms that produce the correct equilibrium distribution only in the limit of infinitely small time step. In practice, the time step 2 therefore represents a tradeoff between speed and stability of the simulation. The simulation proceeds iteratively by alternatively calculating forces and solving the equations of motion based on the accelerations obtained from the new forces [6].

Figure 1.1: Molecular Dynamics Simulation

CPU time and memory requirements still restrict the size of the system that can be simulated, e.g. the number of particles that can be handled with the given resources.

Design of a molecular dynamics simulation should account for the available computational power. Simulation size (N - number of particles), time-step and total time duration must be selected so that the calculation can finish within a reasonable time period. However, the simulations should be long enough to be relevant to the time scales of the natural processes being studied. To make statistically valid conclusions from the simulations, the time span simulated should match the kinetics of the natural . 3

During a classical MD simulation, the most computational intensive task is the evaluation of the forces as a function of the particles' internal coordinates. Within that energy evaluation, the most expensive one is the non-bonded or non-covalent part. Common molecular dynamics simulations are of the order of O(N2) if all pair-wise electrostatic and van der Waals interactions must be accounted for explicitly [1].

The size of the integration step is another predominant factor that impacts total CPU time required by a simulation. This is the time frame between evaluations of the forces. The time-step must be chosen small enough to avoid discretization errors. Also, the simulation box size must be large enough to avoid discrepancies at the boundaries.

1.3 Dissipative Particle Dynamics

Among the complex systems studied with computer simulations are soft matter systems that often consist of a liquid environment in which the objects of interest are dissolved.

These objects may be colloids or polymers in solution, or even biological building blocks. The dynamics of such a system is fundamentally affected by its microscopic structure. As a consequence, a complex fluid is no longer completely described by the

Navier-Stokes equation. This is due to the disparate length scales which influence the dynamics, that is, the observed phenomena on the macroscopic scale are caused by the constituents that exist on the microscopic scale. The presence of these disparate length scales poses a problem for computer simulations. Numerical solvers for the Navier-

Stokes equation are inadequate because the microscopic structure of the complex fluid cannot be incorporated properly. Molecular Dynamics (MD) simulations on the microscopic level are also inappropriate, as overly long simulation runs are necessary to 4 study macroscopic phenomena. Therefore, new techniques are needed to bridge the gap between the microscopic and the macroscopic level.

Mesoscopic simulation methods try to overcome the problems by using the information about the characteristic length scales of the system to devise new models for complex fluids and efficient computational techniques. They aim at modeling the fluid on some intermediate or mesoscopic length scale that captures the relevant dynamics. Dissipative particle dynamics (DPD) is a particle based mesoscopic simulation method for isothermal complex fluids. DPD is an off-lattice mesoscopic simulation technique which involves a set of particles moving in continuous space and discrete time [2]. Rather than representing single atoms, particles represent whole molecules or fluid regions and atomistic details are not considered relevant to the processes addressed. The particles‟ internal degrees of freedom are integrated out and replaced by simplified pairwise dissipative and random forces, so as to conserve momentum locally and ensure correct hydrodynamic behavior. The main advantage of this method is that it gives access to longer time and length scales than are possible using conventional MD simulations.

The updating of the positions and momenta of the particles can be done in a standard MD fashion. In order to achieve a coarse-grained level of description, the interparticle forces contain dissipative and stochastic contributions. The idea of the dissipative particle dynamics method is illustrated in figure 1.2. In dissipative particle dynamics, the particles have continuous positions and interact through pairwise forces that contain a conservative, a dissipative and a random part.

The force acting on a particle has three parts, each of which is a sum of pair forces

5

C D R ∑ F ij + F ij + F ij)

C D The three contributions are a conservative force F ij , a dissipative force F ij and a

R random force F ij [1].

Figure 1.2: Illustration of Dissipative Particle Dynamics approach

1.4 Specific Goals

DPD simulations of bio-molecular systems often run for days to months, many events of great scientific interest and pharmaceutical relevance occur on long time scales that remain beyond reach of today‟s computational capability. In this thesis we outline several new implementation techniques that significantly accelerate parallel DPD simulations [5].

These include a novel parallel decomposition method and message-passing techniques that reduce communication requirements, as well as novel communication primitives that further reduce communication time. We have also developed numerical techniques and

6 restructuring of data structures involved in traditional DPD simulations that maintain high accuracy in order to exploit processor-level vector instructions. These methods are embodied in a newly developed DPD simulator code.

Parallelization of the DPD simulation in a distributed environment involves decomposing the computational space into subsets that are distributed across multiple processors. The total volume of the box is divided into P overlapping subsystems of almost equal volume, and each subsystem is assigned to a single processor in a P processors array. The processors follow a single program multiple data (SPMD) model and each processor calculates the forces on the particles within assigned domain.

1.5 Contributions

The thesis focuses on providing a parallel implementation for the DPD simulation in a partitioned global address space (PGAS) model. Global Arrays (GA) library is adopted for the implementation for most of the data structures involved are arrays and GA provides a wide range of for shared-memory programming on distributed-memory computers for multidimensional arrays.

The main goals are as follows:

 Identify computational intensive modules and perform hot spot analysis

 Restructure and reorder the loops and data structures to take advantage of the

vectorization capabilities of the processor

 Provide a parallel implementation in GA with the particles and calculation

domain distributed across the processes

7

 Reduce communication overhead through careful partitioning of the calculation

space

 Comparison between the shared memory parallelization in OpenMP against the

parallelization in Global Arrays (GA) and suggest a hybrid

extension to improve processor utilization and achieve fine-grained parallelism

1.6 Organization

The rest of this document is organized as follows. Chapter 2 discusses a detailed analysis of the application and the parallelization strategies. It begins with a background on PGAS model and Global Arrays and then the motivation behind our approach. It is then followed by the specific goals of the application and the various parallelization techniques adopted for the GA based implementation and the effect of vectorization and multithreading on the overall performance. It involves an in depth analysis of the organization of various modules and discusses the various abstract libraries specific to the application and the various ways in which inter-process communication is minimized and ways to achieve equivalent load balance across the processes. Chapter 3 provides a detailed evaluation of the application outlined through a set of experiments. Finally

Chapter 4 has the conclusions.

8

Chapter 2: System Design

2.1 Background

This chapter explains the complete design of the system. It begins with a brief background on PGAS model and Global Arrays (GA). This is followed by an outline of the application, bottleneck and hotspot analysis. This is followed by a fine-grained analysis of the effect of vectorization and multi-threading on the overall performance of the application. Finally, the parallelization process of relevant modules are described is greater detail.

This section gives a background about PGAS model and Global Arrays. It enumerates the advantages of Global Arrays and outlines the usage of various primitive libraries in the implementation.

2.2 Global Arrays

The Global Arrays (GA) toolkit [7] provides a shared memory style programming environment in the context of distributed array data structures. From the user perspective, it provides a global view of the arrays that are distributed across the process providing a way to access any part of the array as if it was in the local memory. The details pertaining to the distribution of data, addressing, and data access are encapsulated in the global array

9 objects. The information about the actual data distribution and locality can be used to exploit data locality.

The primary target architectures for which GA was developed are massively-parallel distributed-memory and scalable shared-memory systems. GA provides a logical partition of the data structures into "local" and "remote" portions [7]. It recognizes variable data transfer costs required to access the data depending on the proximity attributes. Access to the local portion of the shared memory is assumed to be faster to access than accessing a remote portion. Irrespective of where the referenced data is located the library provides uniform access mechanisms for all shared data. In addition, local portion of the shared data can be accessed directly the same way as any other data in the process‟ local memory. Access to other portions of the shared data must be done through the GA library calls.

GA was designed to augment the message-passing model, and it allows for the use of both shared-memory and message-passing styles of programming in the same program

[9]. GA inherits an execution environment from a message-passing library that started the parallel program. GA is implemented as a library with C and Fortran-77 bindings, and

Python and C++ interfaces have also been developed. Therefore, explicit library calls are required to use the GA model in a parallel C/Fortran program.

2.3 Overview

The basic shared memory operations supported include get, put, scatter and gather. They are complemented by atomic read-and-increment, accumulate (reduction operation that combines data in local memory with data in the shared memory location), and lock 10 operations. However, these operations can only be used to access data in global arrays rather than arbitrary memory locations. It is essential that at least one global array is created before using any data transfer operations. The data transfer operations supported by GA are truly one-sided/unilateral [8] and will complete regardless of actions taken by the remote processes that owns the referenced data. In particular, GA does not require inserting additional GA library calls to assure communication progress on the remote side, thus providing uniform access mechanism for accessing local and remote regions through the GA library calls.

The distribution of the global arrays can be arbitrarily controlled by the programmer and

GA provide support for both regular and irregular distributions. The GA data transfer operations use a global array index-based interface to refer to the shared data rather than referring to it through the addresses. Unlike other global address space models that support remote memory (put/get) operations [8], GA does not require the user to specify the target processes where the referenced shared data resides since it provides a global view of the data structures. The higher level array oriented API (application programming interface) makes GA easier to use, at the same time without compromising data locality control. The library internally performs global array index-to-address translation and then transfers data between appropriate processes. There are provisions to inquire the location of an element or array section and find the process(es) which own a specified section of the global array. The GA toolkit supports the following data types in

C interface, int, long, float, double and struct double complex. Underneath, the library represents the data using C data types.

11

2.4 Programming Model

The Global Arrays library supports two programming styles: task-parallel and data- parallel. The GA task-parallel model of computations is based on the explicit remote memory copy: The remote portion of shared data has to be copied into the local memory area of a process before it can be used in computations by that process.

Figure 2.1: GA Programming Model

There are provisions to obtain data locality information for the shared data [7]. The library offers a set of operations for management of its data structures, one-sided data transfer operations, and supportive operations for data locality control and queries. The

GA shared memory is a result of a compromise between the ease of use and a portable performance. The load and store operations are guaranteed to be ordered with respect to each other only if they target overlapping memory locations. The

12 store operations (put, scatter) and accumulate complete locally before returning i.e., the data in the user local buffer has been copied out but not necessarily completed at the remote side.

The application can manage consistency of its data structures in other cases by using lock, barrier, and fence operations available in the library. The data-parallel model is supported by a set of collective functions that operate on global arrays or their portions.

Inter-processor communication is carried out through the library‟s remote memory copy or collective message-passing operations.

Global arrays are best suited for array-based asynchronous algorithms [8] where communication patterns are irregular. Due to the one sided nature of the programming model data consistency must be explicitly managed.

2.5 Overview of the System

In a DPD simulation, the positions and velocities of particles corresponding to atoms evolve according to the laws of classical physics. This application is a modified coarse- grained level tunable DPD simulation method aimed to demonstrate the validity and performance of DPD for linear polymeric systems [4]. The method can reproduce both static and dynamic properties of entangled linear polymer systems well. Linear and non- linear viscoelastic properties were predicted and despite being a mesoscale technique, the code is able to capture the transition from the plateau regime to the terminal zone with decreasing angular frequency, the transition from the Rouse to the entangled regime with increasing molecular weight and the overshoots in both shear stress and normal-stress differences upon start-up of steady shear. 13

The following outline gives an overall organization of the DPD simulation program with each module explained in greater detail in the following section. int main (int argc, char **argv)

{

GetNameList (argc, argv);

PrintNameList (stdout);

SetParams ();

SetupJob ();

moreCycles = 1;

while (moreCycles) {

SingleStep ();

if (stepCount >= stepLimit)

moreCycles = 0;

}

}

After the initialization phase (GetNameList, SetParams, SetupJob), in the course of which parameters and other data are input to the program or initialized, and storage arrays allocated, the program enters a loop. Each loop cycle advances the system by a single timestep. The loop terminates when moreCycles is set to zero; here this occurs after a preset number of timesteps, but in a more general context moreCycles can be zeroed once the total processing time exceeds a preset limit. The function that handles the processing for a single timestep, including calls to functions that deal with the force evaluation,

14 integration of the equations of motion, adjustments required by periodic boundaries, and property measurements, is void SingleStep ()

{

++ stepCount;

timeNow = stepCount * deltaT;

ApplyBoundaryCond ();

ComputeForces ();

EvalProps ();

AccumProps (1);

if (stepCount % stepAvg == 0) {

AccumProps (2);

PrintSummary (stdout);

AccumProps (0);

}

}

All the work needed for initializing the computation is concentrated in the following function.

void SetupJob ()

{

AllocArrays ();

15

stepCount = 0;

InitCoords ();

InitVels ();

InitAccels ();

AccumProps (0);

}

The following are some of the programming elements that capture the characteristics of the system:

• parameter input with completeness and consistency checks

• runtime array allocation, with array sizes determined by the actual system size

• initialization of variables

• the main loop which cycles through the force computations and trajectory

integration, and performs data collection at specified intervals;

• the processing and statistical analysis of various kinds of measurement

Each time step of the DPD simulation involves computing forces on each particle (force computation) and using these forces to compute updated positions and velocities on each particle by numerically integrating Newton‟s laws of motion (integration). Most of the computational load lies in the force computation. The force computation uses a model called a molecular mechanics force field (or simply, force field), which specifies the potential energy of the system as a function of the atomic coordinates. The force on a particle is the derivative of this potential energy with respect to the position of that particle [9].

16

The force exhibited on a particle is the sum of the interactive forces between particles separated by a cut-off distance. All-pairs method, the simplest to implement is extremely inefficient when the interaction range „rc‟ is small compared with the linear size of the simulation region. All pairs of particles must be examined because, owing to the continual rearrangement of particles that characterizes the fluid state, and it is not known in advance which particles actually interact. Although testing whether particles are separated by less than „rc‟ is only a part of the overall interaction computation, the fact that the amount of computation needed grows as O(N2) rules out the method for all but the smallest values of N.

Cell subdivision [1] method provides a means of organizing the information about particle positions into a form that avoids most of the unnecessary work and reduces the computational effort to the order of O(N). It models the simulation region as a lattice of small cells, and the cell edges all exceed rc in length. Particles are assigned to cells on the basis of their current positions and that interactions are only possible between particles in that are either in the same cell or in immediately adjacent cells; if neither of these conditions are met, then the particles must be at least rc apart and the interaction could be eliminated. Because of symmetry only half the neighboring cells need be considered; thus a total of 14 neighboring cells must be examined in three dimensions [1], and five in two dimensions. The region size must be at least 4rc for the method to be useful.

17

All pairs Cell Subdivision Neighbor Lists

Figure 2.2: The different approaches to computing interactions [1]

The program for the cell-based force calculation organizes particle information in a linked list. Rather than accessing data sequentially, the linked list associates a pointer

„pn‟ with each data item „xn‟, the purpose of which is to provide a non-sequential path through the data. In the cell algorithm, linked lists are used to associate particles with the cells in which they reside at any given instant; a separate list is required for each cell.

The reason for using linked lists is that it is not known in advance how many particles occupy each cell, since the number can be anywhere between zero and a value determined by the highest possible packing density; the use of sequential tables that list the particles in each cell, while guaranteeing sufficient storage so that any cell can be maximally occupied, is memory inefficient. The linked list approach does not have this problem because of the way the cell occupancy data are organized; the total storage required for all the linked lists is fixed and known in advance.

The neighbor list take advantage of the fact that only a small fraction of the particles examined by the cell method – an average of 4π/81 ≈ 0.16 in three dimensions, π/9 ≈

18

0.35 in two – lie within interaction range [1]. A list of such pairs can be constructed from those found by the cell method, but in order to allow this list to be useful over several successive time steps the cut-off radius is relaxed to rn = rc + Δr then it should be possible to benefit from this reduced neighborhood size considering the fact that the particles change in particle position per time step is very small to cause major deviations in the force calculation.

This implies that the list of neighbors remains valid over a number of time steps, even for relatively small Δr. The fact that the list contains particle pairs that lie outside the interaction range ensures that over this sequence of time steps no new interacting pairs can appear that are not already listed. The value of Δr is inversely related to the rate at which the list must be rebuilt [10], and it also determines the number of extra non- interacting pairs that are included in the list. The decision to refresh the neighbor list is based on monitoring the maximum velocity at each time step. The neighbor list is represented as a table of particle pairs. The cell method is used to build the neighbor list, with the cell size now being determined by the distance rn rather than rc [4]. The decision as to when to refresh the neighbor list is based on information about the maximum possible movement of the particles. Refreshing the neighbor list implies complete reconstruction; potentially interacting pairs are recorded in the neighbor list for subsequent processing.

19

2.6 DPD Parallelization in Global Arrays

A distributed memory parallel program consists of several processes using a number of processors that can communicate via a network. In the DPD simulation each process takes responsibility for updating positions of particles that fall in a certain region of space, so as to make it easier to scale the system. The region to be simulated (the global cell) is assumed to be a parallelepiped, divided into a regular grid of small parallelepipeds called ‘cells’. The cells are distributed across the processes through a block distribution so as to reduce the amount of border particles [1] [5] that needs to be communicated with the neighboring processes.

Each process updates the coordinates of the particles in one or more cells that the processor owns. Explicit pairwise computation of neighbor list calculation, non-bonded forces, computation of bonded forces, force interpolation, computation of constraints, and particle migration requires inter-process communication and the communication overhead is kept to a minimum by having to communicate only the border particles to the processes owning the neighboring cells.

Among many parallel implementations of DPD code two approaches for particles redistribution between processors are employed [11].

 The calculation space is sliced along one coordinate and divided up onto

identical sub-cells

 The system is partitioned into a mesh of sub-cells in x, y and z directions.

The particles from cells, which are situated on the boundaries of processor domains, are to be communicated to neighboring processors. These boundary cells define the 20 communication overhead. For the first method of the box division this overhead scales as

N2/3 [3], while for the second method varies as (N/P)2/3. The particles that are the physical neighbors, should also be closer one another in the computer memory.

The second method optimizes better the memory (cache) organization for regular boxes and a very large number of particles. For the particles system confined in a box elongated in one direction, e.g. z-axis, the first method should be better than the second one. Slicing the box along the z-axis considerably simplifies the routing of messages and enables sending them in unblocking way. Each processor sends the message only in one direction to its closest neighbor. The number of particles from the neighboring domain, “cached” on each processor, is smaller for the first method. The load balancing is easier and consists in shifting the boundaries of processor domains along one direction, while for the second method the load balancing schemes are very complex requiring irregular mesh.

Moreover, by dividing the long box into sub-cubes in x-y plane, it may appear that the number of sub-cubes is small, which produces an additional overhead.

The system is represented as an array of structs of „particles‟ which has to be linearized into individual global arrays such that each particle attribute is represented as a separate global array. This is to take advantage of the fact that not all particle attributes are updated in each time step and most of the attributes are read-only [1] [5]. Hence having individual global arrays for attributes helps reduce the amount of data being communicated with neighboring processes in each time step. The particle attributes are distributed across the processes as they are updated at each time step. The read-only data structures are initialized once and are duplicated across the processes. The initial values

21 of the particle values are translated to the global space using „export‟, a block send operation; each process updates the portion of the global arrays that it owns with the corresponding particle attributes. Bulk of the computational time is spent in the force calculation between the interacting pairs and hence an efficient way to parallelize the calculation of the interacting pairs would reduce the computational time to the order of

O(N/P), where „P‟ is the number of processes.

2.6.1 Neighbor List

Construction of a neighbor list involves construction of a cell list which involves the organization of particles as a linked list corresponding to the cells to which the particles belong to. The use of cell list prunes the search space in the construction of the neighbor list [1]. The neighbor list can be represented in various ways, one of which is a simple table of atom pairs – the method used here. An alternative method employs a separate list of neighbors for each atom; all lists are stored in a single array with a separate set of indices specifying the range of list entries for each atom. In either instance, the cell method is used to build the neighbor list, with the cell size now being determined by the distance rn rather than rc (if the system is too small – relative to rn – for the cell method to work, then the more costly all-pairs approach must be used to build the list).

The neighborList is constructed as follows; the master process imports the entire particle position array through a block „import‟ operation which moves data from the global array to the corresponding particle attribute of the local particle structure and computes the cell list which stores information regarding the particles contained in a cell which is then

22 being broadcast to all the other processes i.e., all the processes would have the identical cell lists to compute local neighbor list.

2.6.1.1

Many microprocessors provide short-vector SIMD (single instruction, multiple data) extensions to accelerate multimedia tasks [14]. For example, some Intel and AMD processors provide SSE (streaming SIMD extensions). In such cases, these extensions include the ability to simultaneously perform four 32-bit single precision floating point operations [13]. Unfortunately, these operations require that memory accesses are 16-byte aligned, making it difficult or impossible for a compiler to generate efficient SIMD code automatically [15].

The individual processes calculate local neighbor list for disjoint cells; a pair of particles, pi and pj, are neighbors if F(pi, pj) is less than some given threshold. VTune analysis showed this calculation as a hotspot as it was not being vectorized as the inner loop iterated over a linked list thus inhibiting vectorization.

The sequential loop for Neighbor List calculation is as follows:

For each cell c1 in the system

Choose an adjacent cell c2

Loop over all particles p1 of c1

Loop over all possible neighbors p2 in c2

Compute F(p1,p2)

if F(p1,p2) < threshold

add the pair [p1,p2] to the neighbor list 23

Also VTune hotspot analysis revealed that the customized „rounding‟ functions used in the calculation of threshold were a bottleneck and significantly contributed to computation time within the loop. The rounding functions were replaced with standard macro functions as rounding off error was of less significance considering the scale of the system and lower precision arithmetic suffices to keep the error margin within agreeable limits, which lead to a significant improvement in the serial per loop computation time.

The inner loop in the neighbor list calculation has to be transformed so that particles of a cell are gathered in a fixed sized buffer and the loop iterates over the elements of the buffer as opposed against the linked list. The instructions are reordered and periodic boundary conditions are moved outside the loop without affecting the stability of the system.

Loop transformations for vectorization:

Fix a block_size

For each cell c1 in the system

Choose an adjacent cell c2

Loop over all particles p1

Loop over blocks of particles b2

Loop 1,block_size (over the particles p2 in block b2)

transfer the x,y and z coordinates of p2 into 3 arrays

24

/** The following loop can now be Vectorized **/

Loop 1,block_size

compute F(p1,p2) using the values in the arrays

store the squared distances in an array

Loop over the squared distance array

if F[j] < threshold

add the pair [p1,p2_j] to the neighbor list

The loop transformation and restructuring provided a 10% improvement in computation time of the neighbor list construction and has better cache utilization. Once the local neighbor list is generated each process computes the intra-particle and inter-particle forces for the particles in its region of the neighbor list. At each timestep each process checks if the displacement of the particles in its local space in a given time step does not deviate beyond the threshold as compared against the displacement from the previous time step. If in any of the processes the displacement has exceeded the threshold, the process broadcasts a message to the other processes to update the neighbor list [1] so that all the processes update the neighbor list in the same time step and hence would view the same state of the system.

2.6.2 Force Calculation

The intra-particle force calculation involves the calculation of forces exhibited by segments within a particle on neighboring segments in the same particle. Each process 25 updates the forces for the particles in its local space and exports the values to the global space at the end of the computation.

The inter-particle forces involve the interaction between the neighboring pairs obtained from the neighbor list and each process computes forces for the particles that belong to disjoint cells which are organized in its local neighbor list. Inter-particle forces acting on a particle can be updated by any of the processes as a particle can be present in a neighbor list of any number of processes [11]. In an ideal scenario each process has to read the latest value of the particle acceleration for the force computation through a one- sided „lookup‟ operation and update the new particle acceleration through an „export‟ operation which involves mapping the particle acceleration to its corresponding position in the global array and updating it with the new value. The lookup and export operations would ensure that the processes see the latest values of the particle attributes in-line with the serial implementation but would be a performance bottleneck.

Inter-particle forces includes conservative, dissipative and random forces which are of the form fj+1(pi) = fj(pi) + a*bj+1; where fi(pi) is the force acting on the particle pi at time ti; a and b are scalars. The positions and velocities of the atoms at any time step are calculated from the inter-atomic forces. Usually, such forces are additive i.e., the forces acting on the atoms are the sum of the many body forces due to the interactions of the atoms with their neighbors. Therefore, the total force of an atom is the reduction of the partial forces arising from the interactions with the neighbors.

The behavior of the system is governed by the choice of random number seed used to set the velocities and acceleration. The random numbers generated are uniformly distributed

26 in (0, 1). A leapfrog parallel pseudo random number generator [16] is used to generate streams of random numbers across the processes which are uncorrelated and uniformly distributed across the processes. The processes update the forces and acceleration for the particles in their local neighbor list in a local buffer followed by a global reduction at the end to sum forces acting on a particle that have been computed across the processes. The processes synchronize and export their region of the particle array to the global space from the reduced buffer and import the entire acceleration array from global space as particles might have crossed cell boundaries in a time step and might have moved to another cell.

At the end of the force computation the processes update the other particle attributes and attributes which are not read-only are written back to global space in each time step. The processes synchronize and proceed to the next time step and import values from the global space to update the local particle attributes and proceed ahead with the neighbor list and force calculation. Attributes which are read-only are written to the global space during the last time step so as to enable the master process to gather the attributes to be written to an output file which defines the state transitions of the system for the given simulation time frame.

2.7 DPD Parallelization in OpenMP

In OpenMP, a set of particles can be assigned to each thread, the master thread computes and broadcasts the cell list to the other threads. The cell space is partitioned along two of dimensions in case of a three dimensional system with each thread responsible for calculating the neighbor list for a couple of cells. The threads update a local neighbor list 27 which is private for each thread and compute the range in the global neighbor list to write the local neighbor lists through a prefix sum operation on the local neighbor list lengths across the threads and copy the local neighbor lists to the global neighbor list.

In case of particle decomposition, several threads may try to write the force acting on the same particle, therefore making it necessary to “protect” such writes. This goal can be achieved using either the REDUCTION clause or the ATOMIC directive and, the

“protection” of the write instruction applied to the force array strongly limits the scaling of molecular dynamics codes.

The calculation of the forces consists of a loop running over the particle and its neighbor from the neighbor list:

Loop over the local neighbor list for i = 0, n_neighbors

idx_particle1 = neighbor_list(i)

idx_particle2 = neighbor_list(i+1)

/*** calculation of forces ***/

call compute_conservative_force(idx_particle1,idx_particle2)

call compute_dissipative_force(idx_particle1,idx_particle2)

call compute_random_force(idx_particle1,idx_particle2)

acceleration (t, idx_particle1) = k*forces(l, idx_particle1) +

acceleration (t-1, idx_particle1)

acceleration (t, idx_particle2) = k*forces(l, idx_particle12) +

acceleration (t-1, idx_particle2)

28 where the neighbor list array contains the indexes of all the particle-neighbor pairs.

A straightforward parallel version of the force module can be obtained resorting to a

PARALLEL FOR directive and a REDUCTION(+: acceleration) clause around the external loop, but OpenMP cannot perform reductions on array or structure type variables. The other alternative is to use the ATOMIC directive in which a thread acquires exclusive access to the single memory address corresponding to the acceleration of a particle for the duration of the update.

#PRAGMA OMP PARLLEL FOR PRIVATE(idx_particle1, idx_particle2) for i = 0, n_neighbors

idx_particle1 = neighbor_list(i)

idx_particle2 = neighbor_list(i+1)

/*** calculation of forces ***/

call compute_conservative_force(idx_particle1,idx_particle2)

call compute_dissipative_force(idx_particle1,idx_particle2)

call compute_random_force(idx_particle1,idx_particle2)

#PRAGMA OMP ATOMIC {

acceleration(t, idx_particle1) = k*forces(l, idx_particle1) +

acceleration(t-1, idx_particle1)

acceleration(t, idx_particle2) = k*forces(l, idx_particle12) +

acceleration(t-1, idx_particle2)

}

29

The performance of the version based on ATOMIC directive is bad as it often occurs that a thread tries to update the acceleration of a particle in the neighbor list of another thread.

Sooner or later, the other thread will try to write the force either of the same atom or of an atom which belong to the same cache line. Because of the memory hierarchy, one of the two write instructions is hindered by the other.

2.7.1 Alternative approach

An alternative approach is to store partial accelerations of the particles in an array which has an extra dimension corresponding to the threads. During the loop over the particles each thread updates the entries in the column assigned to it of the acceleration array.

The code which implements the first step of this scheme is as follows:

#PRAGMA OMP PARLLEL PRIVATE(idx_particle1, idx_particle2,thread_id) {

thread_id = OMP_GET_THREAD_NUM()

#PRAGMA OMP FOR for i = 0, n_neighbors

idx_particle1 = neighbor_list(i)

idx_particle2 = neighbor_list(i+1)

/*** calculation of forces ***/

call compute_conservative_force(idx_particle1,idx_particle2)

call compute_dissipative_force(idx_particle1,idx_particle2)

call compute_random_force(idx_particle1,idx_particle2)

local_acceleration (thread_id, t, idx_particle1) = k*forces(l, idx_particle1) + 30

local_acceleration (thread_id, t-1, idx_particle1)

local_acceleration (thread_id, t, idx_particle2) = k*forces(l, idx_particle12) +

local_acceleration (thread_id, t-1, idx_particle2)

}

At the end of the updating step, the partial results calculated by the threads are summed up providing the total value of the forces within the thread. In the global sum step, the threads access the local_acceleration array by rows and sum the partial values calculated in the previous step. The implementation is as follows:

for i = 0, num_threads

for j = 0, num_particles

acceleration(t, j) = acceleration(t, j) + local_acceleration(i,t, j)

The global sum reduction itself can be done in parallel where a set of particles are assigned to a thread and each thread sums all the entries in the rows corresponding to the coordinates of the particles in its set [17].

for i = 0, num_threads

#PRAGMA OMP PARALLEL FOR

for j = 0, num_particles

acceleration(t, j) = acceleration(t, j) + local_acceleration(i,t, j)

31

Figure 2.3: Parallel global sum

32

Chapter 3: Experimental Evaluation

3.1 Performance Tests

In this section, we focus on the experimental evaluation of the system. We have performed a variety of experiments for the different approaches adopted for the parallelization strategy in order to evaluate the of the system corresponding to the approach followed. The experiments were run on IBM Opteron Cluster, where every compute node has dual socket, quad core 2.4/2.5 GHz Opterons, with a minimum of 24

GB RAM connected together by a 10 Gbps or 20 Gbps Infiniband [18]. Also HP Intel

Xeon Cluster with 12 cores/node & 48 gigabytes of memory/node has been used [19].

The goals of the experiments are as follows:

 Profile the serial code to identify hotspots for parallelization and to eliminate

bottlenecks

 Perform serial optimizations

 Analyze the effect of vectorization and restructure loops which inhibits

vectorization

 Evaluate scalability of the distributed memory parallelization in Global Arrays

 Evaluate the shared memory implementation in OpenMP

33

Experiments were run on systems with varied number of particles and particle density.

Experiments were designed to compare the improvement each method had on improving performance of application.

Attribute Value Num system 1 Counter limit for system 100 Num particle 500 Cells.z 8 Cells.y 8 Cells.x 8 Table 3.1: Sample system attributes

Figure 3.1 and 3.2 shows the effect of serial optimizations and vectorization as compared against the serial application. Intel‟s VTune [20] Hotspot analysis revealed the instructions where significant amount of computational time is spent in each module.

Rounding and square root were identified to be bottlenecks in the computation for neighbor list. Since the simulation is not prone to precision errors the rounding function is downgraded which improved the performance by 5%.

Many loops inhibited vectorization because they were in a format which the compiler could not vectorize or they iterated over non-sequential elements in memory.

Vectorization provided a 4x speed up on Oakley which has a vector length of 4. The loops in the neighbor list calculation were restructured to iterate over blocks of particles collected in a fixed sized buffer and by rearranging the instructions within the loop and moving conditional statements outside the loop which provided a 20% improvement in performance of the neighbor list calculation which constituted roughly around 40% of the 34 total computational time. The block size for the fixed size buffer can be optimized depending on the size of the level 2 cache to improve performance. In the experiments a block size of 256 seemed optimal for mid-range systems with lower particle density.

Serial Optimizations 80

70

60

50

40 Old Round Function Time (s) 30 New Round Function 20

10

0 Serial Vectorized Inner Loop transformed and Vectorized

Figure 3.1: Performance improvement through Serial Optimizations and Vectorization

Figures 3.1 and 3.2 correspond to a system of 500, 5000 particles with counter limit for the system i.e., the number of time steps set at 1000 and 100 respectively. The results correspond to runs with the processor specific optimization flag (-xhost) set which optimizes the code for the processor compiled on [21]. Figure 3.3 outlines the performance improvement through loop transformations to make the loops auto vectorizable. The performance gain is more pronounced for larger systems.

35

Serial Optimizations 1000 900 800 700

600 500 Old Round Function Time (s) 400 300 New Round Function 200 100 0 Serial Vectorized Inner Loop transformed and Vectorized

Figure 3.2: Performance improvement through Serial Optimizations and Vectorization

Loop Transformantion 45

40

35

30

25

20 BNL time Time (s) 15

10

5

0 Serial loop Loop transformed

Figure 3.3: Performance improvement through loop transformation

36

The following two figures illustrate the speedup of the application, neighbor list calculation and force calculation. The results correspond to the implementation in which each process computes a local neighbor list for a set of cells and updates the forces and acceleration of the particles in its local neighbor list and exports the particle attributes at the end of each time step.

As the system progresses to steady state particles form clusters and are concentrated in certain regions of the space. This would mean that load imbalance can be introduced through data distribution strategy adopted. Oakley outperforms Glenn due to longer vector lengths, larger per node memory, cache size, use of processor specific optimization flag etc.

Glenn - Vectorized, Local Neighbor List 120

100

80

60 BNL Relative

Speedup IFC Relative 40 Speedup 20

0 1 2 4 8 16 32 64 128 # Procs

Figure 3.4: Glenn speedup, neighbor list vectorized

37

Oakley - Vectorized, Local Neighbor List 90 80 70

60

50 BNL Relative 40

Speedup IFC Relative 30 20 Speedup 10 0 1 2 4 8 12 24 48 96 # Procs

Figure 3.5: Oakley speedup, neighbor list vectorized

Performace Comparison 1200

1000

800

600

Glenn Time (s) 400 Oakley

200

0 1 2 3 4 5 6 7 8 # Procs

Figure 3.6: Runtime comparison between Glenn and Oakley

38

The disadvantage of a cell based distribution of particles is that particles may not be uniformly distributed across the cells and hence load imbalance can be induced due to the differences in the number of particles in a process space and the corresponding neighbor list. An alternative approach is to redistribute the neighbor list equally among the processes. In this case the processes compute a local neighbor list for a set of cells and export the local neighbor list to the global space. The global neighbor list is then redistributed across the processes thus eliminating load imbalance across the processes at later time steps. But this would necessitate the processes to synchronize before they could import a portion of the global neighbor list to proceed with the force calculation. The results indicate that the communication overhead hinder the performance compared against the version with no redistribution of neighbor lists.

Glenn - Redistributed Neighbor List 140

120

100

80 BNL Relative 60 Speedup IFC Relative 40 Speedup

20

0 1 2 4 8 16 32 64 128 # Procs

Figure 3.7: Glenn scaling factor – neighbor list redistributed 39

Oakley - Redistributed Neighbor List 60

50

40

30 BNL Relative

Speedup IFC Relative 20 Speedup 10

0 1 2 3 4 5 6 7 8 # Procs

Figure 3.8: Oakley scaling factor – neighbor list redistributed

Local vs Redistributed Neighbor List 350

300

250

200 Local Neighbor List

Time (s) 150 Redistributed 100 Neighbor List

50

0 1 2 4 8 12 24 48 96 # Procs

Figure 3.9: Local vs. Redistributed neighbor list runtime comparison

40

OpenMP vs GA 12

10

8

6 OpenMP

Speedup GA - Glenn 4 GA - Oakley 2

0 1 2 4 8 12 # Threads/Procs

Figure 3.10: OpenMP vs. Global Arrays performance comparison

Figure 3.10 outlines the performance comparison of the shared memory and distributed memory implementations in OpenMP and Global Arrays respectively. In OpenMP implementation of neighbor list calculation threads collaborate to generate a single neighbor list where as in distributed memory implementation the processes generate local neighbor lists of varying lengths. The threads iterate over equivalent chunks of particle pairs in the neighbor list for force computation and perform a global reduction of computed acceleration at the end of the force computation. In GA implementation the processes either iterate over their local neighbor lists of uneven lengths or redistribute the neighbor lists across the processes necessitating additional communication calls. The processes then perform a global reduction on the computed acceleration values which are

41 then updated in the global space. Thus the communication overhead is kept minimal in a shared memory implementation whereas the system does not scale well.

A viable alternative is to exploit the multiple threads within a processor for parallelization at the intra-node level. The hybrid approach is to run a single processor per node with „k‟ threads that share the work assigned to a specific processor, which may be implemented using OpenMP or POSIX threads (pthreads). Limitations in the implementations of message-passing protocols which limit the maximum number of processes that can be supported can be overcome with the hybrid approach in order to utilize very large numbers of processors. This implementation is highly portable across platforms.

3.2 Summary

This chapter discussed the various performance evaluation of our system. Various experiments were carried out to evaluate our system which focused on evaluating our parallel version with the underlying sequential application and to compare the various implementation strategies. Scalability and other performance metrics were verified by evaluating the parallel implementation across a wide range of systems of varying sizes and characteristics. The scalability analysis indicates the neighbor list calculation to be the communication and computation hotspot because of the underlying data structures involved.

42

Chapter 4: Conclusions

4.1 Conclusion

In this work, we have demonstrated a parallel implementation of a DPD simulation application in a PGAS environment and have analyzed the performance of various alternative approaches involved in the parallelization process to better understand the scalability of the underlying parallel algorithms.

The distributed memory implementation is compared against a shared memory implementation in OpenMP to understand the impact of thread level parallelism at a node level so as to keep the communication overhead minimum. The algorithms associated with both the implementations are analyzed to identify a hybrid approach which would combine the best of both the approaches to achieve a given scalability on a lower number of processors by scheduling the work on a given process across multiple threads within the processor. We have also done a detailed analysis on serial optimizations and ways to improve data parallelism.

43

Bibliography

[1] D. C. Rapaport. The Art of Molecular Dynamics Simulation, Cambridge

University Press, 2004.

[2] P. J. Hoogerbrugagnde J. M. V. A. Koelman. Simulating Microscopic

Hydrodynamic Phenomena with Dissipative Particle Dynamics. Europhys. Lett.,

19 (3), pp. 155-160 (1992).

[3] MD Parallel Algorithms. http://www.sandia.gov/~sjplimp/md.html.

[4] Patrick B Warren. Current Opinion in Colloid & Interface Science. Volume 3,

Issue 6, December 1998, Pages 620–624.

[5] Krzysztof Boryczko, Witold Dzwinel, David A.Yuen. Parallel Implementation of

the Fluid Particle Model for Simulating Complex Fluids in the Mesoscale.

[6] Molecular Dynamics. http://www.rug.nl/fmns-research/moleculardynamics/index.

[7] GA toolkit. www.emsl.pnl.gov/docs/global/

[8] P. Saddayappan, Bruce Palmer, Manojkumar Krishnan, Sriram Krishnamoorthy,

Abhinav Vishnu, Daniel Chavarría, Patrick Nichols, Jeff Daily. Overview of the

Global Arrays Parallel Software Development Toolkit: Introduction to Global

Address Space Programming Models

[9] PGAS Programming Model.

http://www.pgas.org/index.php?option=com_content&view=category&layout=bl

og&id=37&Itemid=56.

44

[10] Mikio Yamanoi, Oliver Pozo and Joao M. Maia. Linear and non-linear dynamics

of entangled linear polymer melts by modified tunable coarse-grained level

Dissipative Particle Dynamics. The Journal Of Chemical Physics 135, 044904

(2011).

[11] Kevin J. Bowers, Edmond Chow, Huafeng Xu, Ron O. Dror, Michael P.

Eastwood, Brent A. Gregersen, John L. Klepeis, Istvan Kolossvary, Mark A.

Moraes, Federico D. Sacerdoti, John K. Salmon, Yibing Shan, David E. Shaw.

Scalable Algorithms for Molecular Dynamics Simulations on Commodity

Clusters.

[12] Jari Mononen, Ville Nenonen. Benefits of in Scientific

Calculations.

[13] Vector Processing Futures. June, 2000 by Carlo Kopp.

[14] Vector processor. http://cva.stanford.edu/classes/ee482s/scribed/lect11.pdf.

[15] Auto Vectorization. http://gcc.gnu.org/projects/tree-ssa/vectorization.html.

[16] Parallel Random Number Generator.

http://www.nersc.gov/users/software/programming-libraries/math-libraries/acml/.

[17] Simone Meloni, Alessandro Federico and Mario Rosati. Reduction on arrays:

comparison of performances among different algorithms. CASPUR, Inter

University Consortium for Supercomputing, Via dei Tizii 6/b, I-00185, Rome,

Italy. August 31, 2003.

[18] Glenn Specifications.

http://www.osc.edu/supercomputing/computing/opt/index.shtml

45

[19] Oakley Specifications.

http://www.osc.edu/supercomputing/computing/oakley.shtml.

[20] Intel VTune Amplifier XE. http://software.intel.com/en-us/articles/intel-vtune-

amplifier-xe/.

[21] Intel C++ Compiler (icc) Options.

http://software.intel.com/sites/products/documentation/hpc/compilerpro/en-

us/cpp/mac/man/icc.txt.

46