The Pennsylvania State University The Graduate School College of Engineering

PARALLEL PARTICLE-IN-CELL

PERFORMANCE OPTIMIZATION: A CASE STUDY OF

ELECTROSPRAY SIMULATION

A Thesis in Computer Science and Engineering by Ramachandran Kodanganallur Narayanan

© 2016 Ramachandran Kodanganallur Narayanan

Submitted in Partial Fulfillment of the Requirements for the Degree of

Master of Science

May 2016 The thesis of Ramachandran Kodanganallur Narayanan was reviewed and approved∗ by the following:

Kamesh Madduri Assistant Professor in Department of Computer Science and Engineering Thesis Advisor

Mahmut Taylan Kandemir Professor in Department of Computer Science and Engineering Director of Graduate Studies

John Hannan Associate Professor in Department of Computer Science and Engineering Interim Associate Department Head

∗Signatures are on file in the Graduate School.

ii Abstract

The particle-in-cell (PIC) numerical technique is frequently used in physics and engineering simulations. In this work, we describe ES-PICBench, a new shared- memory parallel implementation of the PIC technique for electrospray simulations. Electrospray simulations are used in aerospace applications, and the goal of an electropray simulation is to understand behavior of an electrospray thruster or a colloid thruster. We discuss performance optimizations for various steps of a PIC- based electrospray simulation. One of the main steps in this simulation is solving the Poisson partial differential equation, and this step can be in turn converted to solving a system of linear equations. We develop a parallel implementation of the Multigrid method for this step. We demonstrate that ES-PICBench is significantly faster than other parallel PIC electrospray simulation implementations on Intel Xeon multicore platforms. Further, ES-PICBench can serve as a real-world scientific computing benchmark for analyzing parallel system performance.

iii Table of Contents

List of Figures vii

List of Tables ix

List of Symbolsx

Acknowledgments xii

Chapter 1 Introduction1

Chapter 2 Problem Formulation3 2.1 Problem Parameters ...... 4 2.2 Methodology ...... 6 2.3 Prior Work ...... 8

Chapter 3 PIC Implementation9 3.1 Creation of Grid ...... 10 3.2 Calculation of Release Rate ...... 10 3.3 Preallocation and Storage ...... 11 3.4 Releasing Particles ...... 11 3.5 Particles leaving the domain...... 13 3.6 Weighting charges to nodes ...... 14 3.7 Resort Particles...... 17 3.8 Resetting RHS Vector ...... 18 3.9 Moving Particles ...... 18 3.10 Parallelization...... 19 3.10.1 Releasing Particles ...... 20

iv 3.10.2 Particles leaving the domain...... 20 3.10.3 Update Charge Fractions...... 21 3.10.4 Moving Particles ...... 22 3.10.5 Performance...... 22 3.11 Results-Particle Distribution...... 24

Chapter 4 Poisson Solve 27 4.1 Initial Serial Solver...... 28 4.1.1 Matrix-free approach...... 28 4.1.2 Exploration of Solver Options with PETSc...... 31 4.2 Results...... 32

Chapter 5 Multigrid Method 35 5.1 Methodology ...... 35 5.1.1 Pre/Post Smoother...... 36 5.1.2 Restriction...... 36 5.1.3 Recurse/Direct Solve...... 36 5.1.4 Prolongation ...... 37 5.1.5 Matrix-Free Approach ...... 37 5.2 Implementation...... 39 5.3 Validation...... 41 5.3.1 Dirichlet conditions...... 41 5.3.2 Mixed Neumann ...... 43 5.3.3 Small domain length issue...... 45 5.3.4 Troubleshooting Multigrid...... 45 5.4 Comparison with PETSc...... 46 5.5 Parallelization...... 46 5.5.1 Smoother ...... 46 5.5.2 Direct Solve...... 47 5.5.3 Remaining steps...... 48 5.5.4 Results...... 48 5.5.5 SubComponent Timings ...... 49 5.5.5.1 Smoother...... 50 5.5.5.2 Restriction and Prolongation ...... 51 5.5.6 Smoother Performance...... 52 5.6 Code Results ...... 55 5.6.1 Potential Distribution ...... 55 5.6.2 Effect of Particles on Potential...... 55

v Chapter 6 Conclusions and Future Work 57 6.1 Future work...... 57

Bibliography 59

vi List of Figures

2.1 Side View With Boundary Conditions...... 4 2.2 View from the Capillary with Boundary Conditions...... 5 2.3 View from the Extractor with Boundary Conditions...... 5

3.1 Outline of Particle class ...... 9 3.2 Insertion of new particles into gaps ...... 13 3.3 Swap particles at end to cover leftover gaps...... 14 3.4 Charges in a sample 2D grid. Arrows show the contribution towards a particular node...... 15 3.5 Area weighted contribution of charge...... 15 3.6 Laplace Particle Routine Scalability...... 24 3.7 Particle Distribution at Timestep 50 ...... 25 3.8 Particle Distribution at Timestep 100...... 25 3.9 Particle Distribution at Timestep 150...... 25 3.10 Particle Distribution at Timestep 250...... 25 3.11 Front view of distribution at 250 timesteps...... 26

4.1 Form of Matrix A with coefficients in case of Laplace equation. . . . 27 4.2 7-pt stencil with Axes convention...... 28 4.3 Sparsity Pattern of matrix A...... 29 4.4 Evaluating Ax = b...... 30 4.5 Graph indicating scaling for Poisson Run...... 34

5.1 Allocation of hierarchy of grids for a 2D face...... 40 5.2 Form of the V-Cycle...... 40 5.3 Iterations taken for varying Problem Size...... 42 5.4 Iterations taken for varying Problem Size...... 44 5.5 Red-Black coloring ...... 47 5.6 Multigrid Dirichlet (9 levels, 3 coarse points) Speedup graph. . . . . 49 5.7 Scalability plot of Smoother from the Subcomponent Timings. . . . 51

vii 5.8 Plot of Restriction and Prolongation Scalability...... 52 5.9 Smoother Scalability with 7 Multigrid levels...... 54 5.10 Potential Distribution at Timestep 50...... 55 5.11 Electric Field at Timestep 50 ...... 55 5.12 Potential Contours Timestep 50...... 56 5.13 Potential Contours at Timestep 250...... 56

viii List of Tables

2.1 Domain Lengths and Voltage Values used...... 4

3.1 Scalability study for the particle routines...... 23 3.2 Speedup values for the Moving Particles routine and for the Total Time...... 23

4.1 PETSc-3.4.3 Solver and combinations explored. . . . 31 4.2 PETSc-3.6.2 Solver and Preconditioner combinations explored. . . . 31 4.3 Details of Performance Setup on a Single Node on Ganga Cluster. . 32 4.4 Scaling results for the Poisson Run with 500 Timesteps...... 33

5.1 Validation of Multigrid Solver with Dirichlet conditions...... 42 5.2 Validation of the Multigrid solver with Neumann Boundary Conditions. 43 5.3 Multigrid Dirichlet with 9 levels and 3 points on coarsest level. . . . 48 5.4 Timings for the individual subcomponent steps in the Multigrid Dirichlet case...... 50 5.5 Scalability study of Smoother from the Subcomponent Timings. . . 50 5.6 Tabulation of Restriction and Prolongation Scalability...... 52 5.7 Details of the Memory Hierarchy for LionXG...... 53 5.8 Cachegrind Performance Results...... 53 5.9 Standalone Smoother Scalability Setup with 7 Multigrid levels and 3 coarse grid points...... 54

ix List of Symbols

∇ Gradient operator, p.3

Φ Electric potential, p.3

ρq Density of Space charge, p.3

o Free space permittivity, p.3 E Electric Field, p.3

NMD Release rate of particles in the Molecular Dynamics code, p. 10

∆tMD Molecular Dynamics code timestep, p. 10

Np,rel Release rate of particles in a PIC timestep, p. 10

tPIC PIC timestep, p. 10

Ip,rel Integer part of Np,rel, p. 10

Dp,rel Decimal part of Np,rel, p. 10

ryz,MD The magnitude of position vector of released particle from capillary center, p. 11

vyz,MD The magnitude of velocity of released particle from capillary center, p. 11

θ, η Angles used while randomizing particle attributes, p. 11

yPIC , zPIC New YZ positions after rotating position vector through angles, p. 11

x vy,P IC , vz,P IC New YZ velocities after rotating velocity vector through angles, p. 11

q Charge on particle, p. 18

F~ Force acting on particle of charge q, p. 18

mp Mass of particle, p. 18 ~v t Velocity of particle at time t, p. 18

xi Acknowledgments

This research is supported in part by the US National Science Foundation award #1253881. Firstly, I would like to thank Dr. Kamesh Madduri for the support throughout this research project and his guidance on tackling various issues at different stages of the project. I had learnt valuable lessons prototyping an and the aspects to look out for, while aiming for performance. The ability to look at problems farther away as well as dive into the details, was particularly instructive. I would also like to thank the Computing Facilities at Penn State-ICS for the use of the LionX clusters. Being able to run and use the high performance clusters with minimum downtime was extremely valuable. The large variety of software available on the clusters was very helpful in debugging and working with my own research codes. Similarly the computing facilities at the Department of Computer Science and Engineering was always available for running my codes. I would like to thank my labmates and friends for the support at various stages of my stay here at Penn State. The journey was a lot easier thanks to them and it was good to know I could rely on them at all times. I would especially like to thank my former batchmate Rajesh Gandham for helping me with understanding the multigrid algorithm and troubleshooting my implementation at certain important stages. Finally, I would like to thank my parents and relatives for affording me the opportunity to come to Penn State and pursue my graduate studies. Moving away from home to pursue other opportunities has always been difficult but their patience and support has been invaluable at various stages and has allowed me to work towards what I have wanted.

xii Chapter 1 | Introduction

Colloid thrusters, or electrostatic thrusters, are used in small spacecraft systems, and work on the principle of electric propulsion. Electric propulsion is a method to generate a propulsive force by accelerating charged particles using a high electric field. The charged particles come from the propellant used by the colloid thruster. In order to characterize efficiency of colloid thrusters, numerical simulations are performed. Electrostatic thrusters are in used in scenarios where very low thrust is required [1], typically in the order of micro Newtons. They also have a relatively high efficiency, and are suited for careful manoeuvring or for steady acceleration over a long period of time. The particle-in-cell (PIC) numerical technique can be used to simulate electro- spray, or charged particle emissions, in a colloid thruster [2]. The main approach in such a technique involves tracking the particle positions in continuous space and mapping them to “cells” or discretizations of the domain. This allows us to calculate physical attributes such as the charge density which is computed simultaneously on a stationary mesh. With a larger number of particles per cell, the PIC method can also be used to calculate statistically meaningful quantities, based on the velocities of particles in each cell. We solve the Poisson Equation for the potential and use a finite difference based approach (among the choices of finite difference, finite volume and finite element) due to the ease of implementation of a finite difference based grid. Wang et.al [2] had a finite difference Cartesian grid (rectangular grid) based implementation and used PETSc [3–5] as a Solver for solving the Poisson Equation using a Multigrid Solver and MPI-based domain decomposition. The scaling performance of the code was seen to be limited by the scaling of the Multigrid solver. Korkut et.al [6] used

1 an Adaptive Mesh Refinement (AMR) approach to selectively refine cells in the area of interest, but was ultimately solving a flow problem outside our domain of interest. Based on this, we also had an AMR based code similar to [2] and implemented it using Deal.II [7,8] for our particular domain. However the main bottleneck seemed to be the search routine for a cell that a given particle belonged to and the creation of library specific data structures for the finite element method. We felt that the overall structure of the problem was simple enough to warrant a standalone code approach that did not rely on external libraries. Also, we noticed that the use of libraries often impedes the use of convenient multi-threading options like OpenMP. So a standalone code can be parallelized without facing constraints that a library might impose. Also, we focused on keeping the data structures simple by using C arrays and simple structs as well as minimizing the number of files, so that the compiler could better optimize the code. The overall guiding principle was to keep the implementation simple and straightforward. Finally we also gain an advantage of specializing the implementation to the problem in ES-PICbench and we can avoid a lot of other corner cases that a more generalized code would have to handle.

2 Chapter 2 | Problem Formulation

The aim is to move charged particles (generated from a source) in a domain. The domain consists of a Capillary (represented by a small circle) from where we release charged particles at each time step and an Extractor (similar to an annular circle), which extracts the generated ions by being set to a lower voltage. Based on the voltages set on the domain boundaries, we calculate the potential distribution by solving the Poisson’s equation given in Equation (2.1).

−ρ ∇2Φ = q (2.1) o

where Φ is the potential, ρq is the density of space charge and o is the free-space permittivity. We use the finite difference method to discretize the domain due to its ease of implementation and also since it is suitable for the regular Cartesian grid that we are going to use. As an initial approximation we solve the Laplace equation, denoted by ρq = 0 and given in Equation (2.2)

∇2Φ = 0 (2.2)

From the potential distribution, we evaluate the Electric field using Equa- tion (2.3)

E = −∇Φ (2.3)

and which we use to move the particles at each time step.

3 2.1 Problem Parameters

An outline of the problem along with the boundary conditions is shown in Figure 2.1:

Figure 2.1. Side View With Boundary Conditions.

where LV and HV stand for Low Voltage and High Voltage respectively and represent the potentials to which the relevant boundaries are set. The left side represents the Capillary held at Low Voltage and the right side represents the annular Extractor held at High Voltage. The relevant voltage values and lengths are outlined in Table 2.1:

Parameter Value Domain Length 0.3 mm Capillary Radius 13 µm Capillary Voltage (LV) 0 V Inner Extractor Radius 0.1 mm Outer Extractor Radius 0.14 mm Extractor Voltage (HV) −1350 V Table 2.1. Domain Lengths and Voltage Values used.

4 Further, the view from the “back” or the Capillary side and the “front” or the Extractor side are shown in Figure 2.2 and Figure 2.3 respectively.

Capillary

Figure 2.2. View from the Capillary with Boundary Conditions.

Extractor

Figure 2.3. View from the Extractor with Boundary Conditions.

5 2.2 Methodology

The overall steps in modelling the simulation are:

1. Read in parameter file that specifies problem parameters.

2. Create grid and read in relevant Molecular Dynamics (MD) dataset files. These files are used to randomly select a set of particles to release from the capillary, at each timestep. As such these datasets represent the physics that occurs inside the capillary in an abstracted manner as outlined in [9].

3. Calculate the release rate so that the number of particles to release at each time step is known. Refer Section 3.2.

4. Solve the Laplace equation first (Equation (2.2)) and simulate the number of timesteps it takes for a single particle to exit the domain. This is useful for preallocating the global Particles list. The Laplace equation acts as a good first approximation for this step. Refer Section 3.3.

5. At each time step

• Release particles as calculated before, taking into account particles which have left the domain. Refer Section 3.4. • Swap particles at the end of the domain into leftover gaps, if any. • Map particles to cells based on relevant weighting schemes, to calculate

the charge density ρq in Equation (2.1). • The matrix A represents the coefficients of the equations obtained after discretizing Equation (2.1), b represents the charge density at each node in that equation and x represents the solution vector which we need to obtain. We now solve the linear system Ax = b, where the vector b has changed, to obtain the potential. We do this in a Matrix-Free manner and this is explained in Chapter5. • Update the Electric Field, using Equation (2.3). • Move particles in the updated Electric field using the half timestep based scheme. Keep track of indices of particles which leave the domain, so that new particles can be introduced at those locations. Refer Section 3.5

6 6. Output data files and results if necessary.

The implementation in ES-PICbench follows the overall schematic as shown in Algorithm1.

Algorithm 1 Overview of the ES-PICbench PIC Schematic function ESPIC-Main PreprocessData()

for i = 1 → nsteps do . nsteps is number of timesteps ReleaseParticles() SwapGapsWithEndParticles() UpdateChargeFractions() Solve(A, x, b) CalculateElectricField() MoveParticlesInField() end for PostprocessData() end function

Our approach in ES-PICbench differs from traditional PIC codes due to the nature of the problem, which has features that distinguish it from other typical use cases. These features are:

1. Constant inflow of particles into the domain at each timestep. Most PIC codes might have to deal with a transient flow of particles that can change based on the elapsed time.

2. The boundary conditions involve well known Dirichlet and Neumann con- ditions. More general PIC codes might have to handle mixed boundary conditions or conditions that are more apt for higher particle numbers and interaction with magnetic fields, whereas in our case the number of particles in the domain is lower and the physical modelling is simpler.

3. The variation of the Particles Per Cell along the generated direction can be non-uniform, compared to a more uniform case in other PIC codes.

7 2.3 Prior Work

For a colloid thruster, the work done by Wang et.al [2] demonstrated that the number density changes (number of particles per volume) by several orders of magnitude from the Capillary to the Extractor. This in turn affects a particular physical quantity called the Debye Length which is a measure of charge separation. Resolving the Debye Length however, with a Cartesian grid is quite costly and runtimes often went up to many hours to a day. So one would like to selectively have a finer grid only where it matters and a coarser grid in places where this physical quantity need not be resolved as closely. This necessitates a scale dependent approach and Korkut et.al [6] used Adaptive Mesh Refinement or AMR grids to selectively refine a mesh. In [6] however, the cells were refined locally without affecting the rest of the grid and the problem was more applicable to a physical domain that is outside of the domain of our interest. The main aim was to solve Poisson’s Equation on an Unstructured Mesh (like an AMR grid). However, this introduces more constraints. Once we start refining parts of an AMR grid, we need methods to maintain a “balance constraint” [10] (to eliminate inconsistencies in the system of equations in the form of “hanging nodes”) and handle the more complicated grid and neighbor connectivity. In this area, the finite element library Deal.II [7,8] is very useful as it handles the and the solution of the equations, by interfacing to a solver like PETSc (among others). Deal.II can generate and selectively refine AMR grids based on any criteria and encourages useful programming practices and provides support through extensive tutorials and documentation. The Deal.II code took around 1.5 hours for a run involving 8000 timesteps and generates around 28,000 particles, on a single node on the LionXG cluster. For the same problem setup, our approach takes around 10 seconds. It is possible that Deal.II with its sophisticated data structures and various interfaces to packages such as PETSc and p4est may be inappropriate from a performance standpoint for this particular Capillary-Extractor problem. Our aim in this work has been to develop a lightweight solver devoid of libraries and using only C constructs, in an attempt to develop a benchmark (in the form of ES-PICbench) for these kinds of problems.

8 Chapter 3 | PIC Implementation

To store the particles generated in the simulation, we develop methods for storing and managing particles in the domain. We denote a general particle by a simple Plain-Old-Data struct (shown in Figure 3.1) which stores the relevant positions, velocities, charge and mass information. We decided to keep the data structures simple and use raw pointers for arrays as we believed it would lessen indirections and allow better optimizations by the compiler.

Figure 3.1. Outline of Particle class

The preallocated Particle array simply holds all released particles in the domain. The ordering within this array has no bearing on the physical location in the domain. This allows us to update particle positions immediately with no dependency between array locations. When solving the Poisson equation however, we need more steps for the complete integration, which are outlined in Section 3.6 and Chapter4.

9 3.1 Creation of Grid

The multigrid approach is used for the solution of the Poisson Equation. We performed numerical experiments with PETSc and found that the Multigrid method is well suited for elliptic equations such as the Poisson Equation (Refer Section 4.1.2). We create this grid at runtime by passing in the number of grid points at the coarse level and the desired number of multigrid levels. Using this, a hierarchy of grids are created and the details are explained at Section 5.2. These grids are stored as C one dimensional arrays. A point with coordinates (i, j, k) corresponds to (N 2i + Nj + k), where N is the number of points on one side of the cube. This has implications for the strided access in memory and an investigation on the cache performance was carried out at Section 5.5.6.

3.2 Calculation of Release Rate

We denote the time step in the PIC code as tPIC , time step in the Molecular

Dynamics data as ∆tMD, the number of particles released in the PIC code as

Np,rel and the number of particles in the MD dataset as Np,MD. Each MD dataset corresponds to a particular mass flow rate condition and contains the attributes

for a set of particles (NMD) released from the capillary during some time interval

(∆tMD) and the details of this work are at [9]. We also modify the procedure from [2] to account for the fractional part of the release rate and this is outlined in Section 3.4.

For each MD dataset, the corresponding ∆tMD is tabulated in [2]. The tPIC is taken as 230 ps (pico second or 10−12 seconds), from the tables in the work as well. The number of particles released at each time step in the PIC code, must match the number of particles “flow rate” from the MD dataset. That is

N N p,rel = MD tPIC ∆tMD Therefore

NMD × tPIC Np,rel = ∆tMD This will typically be a floating point number and only an integer number of

10 particles can be released per tPIC . Hence we store the integer and fractional parts separately.

3.3 Preallocation and Storage

The simulation will output a distribution of particles spatially. Hence the Particles are simply stored in a preallocated Particle array. In order to preallocate, it is important to estimate the number of particles that might be in the domain. We do this by simulating the movement of a single particle (choosing a particle with least positive X-Velocity) in a field obtained by solving the Laplace equation and estimating the number of timesteps taken to leave the domain (say Test).

From Section 3.2, if we release Ip,rel particles at each timestep, then the expected number of particles would be (Ip,rel ×Test) at “steady state”. In practice, we observed a significantly larger number of particles, i.e. around double the calculated value. We attribute this to the test particle not being representative enough and there could be many particles which take longer to leave the domain. So to account for these variations, we simply preallocate the particle array size to double the calculated value, i.e. 2Ip,relTest and was seen to be sufficient for various runs.

Np,rel = Ip,rel.Dp,rel

Ip,rel = Int(Np,rel)

Dp,rel = F rac(Np,rel)

3.4 Releasing Particles

For releasing the particles, we follow a different approach than [2]. Instead of generating random numbers to decide when to release more particles (to account for Dp,rel), we account for the fractional part by:

• Releasing Np,rel particles at each time step.

• Track the running fractional part with another variable. At each time step,

we add Dp,rel to this running fractional part.

11 • Whenever the running fractional part gets a non-zero integer part, we add it to the number of particles to release and so with this, we can account for the fractional part.

Although the MD data provides a sample size of particles to release, if we release all the selected particles as such, we would only have particles with initial positions and velocities as given by the dataset. This may not be a good distribution as particles with other velocities must be present too. For this purpose, whenever a particle is released from the MD dataset, we rotate its Y and Z attributes (position and velocity). This ensures a better “spread” of particles with differing velocities. The X components are preserved, but we rotate the Y and Z attributes in the same manner as [2] and the outline of the approach is shown in Algorithm2.

Algorithm 2 Randomize an input particle’s Y and Z Attributes function randomizeParticleAttribs(p) . Input particle p θ ← randNum . Between 0 and 2π rnd √ 2 2 ryz ← p.y + p.z . Adjusting y and z about the radial center

p.y ← ryz cos θrnd

p.z ← ryz sin θrnd

Adjust the velocity components q 2 2 vyz ← p.Vy + p.Vz

p.Vy ← vyz cos θrnd

p.Vz ← vyz sin θrnd return p end function

Once a particle’s attributes are randomized, the we release the particles into the domain with the approach in Algorithm3.

12 Algorithm 3 Release particles into the domain

procedure releaseParticles(np) . np is the number to release, from 3.2

for i ∈ np do p ← randomizeParticleAttribs(random pick from MD data) if gaps are left then Insert into gap gapCount = gapCount − 1 else Insert at last index of domain array of particles domainCount = domainCount + 1 end if end for end procedure

If any gaps are left, we introduce the new particle into existing “gaps” as outlined in Section 3.5.

3.5 Particles leaving the domain

In the course of a simulation, there can be many particles which leave the domain. Whenever particles are moved, we use the final position to check whether the particle has left the domain. If so, we store the index (of the preallocated Particle array) where this happens in another array and this index position is denoted as a “gap”. When we introduce new particles, we simply insert into the gaps as shown in Figure 3.2.

Figure 3.2. Insertion of new particles into gaps

13 If the gaps become full, we insert newly released particles at the end of the array. If any gaps are still left, we simply copy over particles from the end of the array into the gaps (as shown in Figure 3.3) and adjust the counts accordingly. The overall pseudocode for the swapping procedure is shown in Algorithm4.

Algorithm 4 Fill up any remaining gaps after inserting released particles into earlier gaps. function swapGapsWithEndParticles(domainParticles, count, lostParti- clesList) . count is like the bound for the domainParticles array for index in lostParticlesList do domainP articles[index] = domainP articles[end] . end refers to end of domainParticles array count ← count − 1 . adjust the bound end for return count end function

Figure 3.3. Swap particles at end to cover leftover gaps.

Using this, we ensure that the simulation works with only particles that are inside the domain.

3.6 Weighting charges to nodes

To do a Poisson solve at each time step we need to calculate the RHS of Equa- tion (2.1), which represents the charge density. In the finite difference approach, we calculate the charge density at a node more accurately by considering the contribution of the particle to each of the nodes.

14 Although we consider charges here, this can be extended to other charge specific quantities (say, like the Electric Field). A typical step in a simulation would have charges distributed throughout the grid. A 2D sample (2D grid with some particles) can be as shown in Figure 3.4.

Figure 3.4. Charges in a sample 2D grid. Arrows show the contribution towards a particular node.

The charge density of the center node in the figure is the sum of all fractional contributions from each of the charges. For a given charge in a cell, we base the contribution off the highlighted “areas” in Figure 3.5. This weighting scheme was taken from [11] and is easier to show in 2D.

Figure 3.5. Area weighted contribution of charge.

Define hx = x/L, hy = y/L. Then the weighting factors for nodes 1, 2, 3, 4 are:

15 w1 = (1 − hx)(1 − hy)

w2 = (hx)(1 − hy)

w3 = (hx)(hy)

w4 = (1 − hx)(hy)

Similarly, although it is harder to show graphically for 3D, the relevant weight factors (with suitable node numbering) with hx = x/L, hy = y/L, hz = z/L:

w1 = (1 − hx)(1 − hy)(1 − hz)

w2 = (hx)(1 − hy)(1 − hz)

w3 = (1 − hx)(hy)(1 − hz)

w4 = (hx)(hy)(1 − hz)

w5 = (1 − hx)(1 − hy)(hz)

w6 = (hx)(1 − hy)(hz)

w7 = (1 − hx)(hy)(hz)

w8 = (hx)(hy)(hz)

The weight factors can be used to account for the contribution of the particle, either for the calculation of its charge or even for the re-weighting of the Electric field back to the charge. The calculated Electric field will be available at the nodes. We use these values to interpolate the Electric field on to the position of the charge, rather than, say, taking the value of the field from the node closest to the particle. Since getting the cell weights can be a common operation, we abstract it out as shown in Algorithm5.

16 Algorithm 5 Calculate the cell weights given a particle’s position. procedure getCellWeights(pos) weight[] ← [0] indices[] ← [0] i ← (int)(x/∆L) . ∆L is the cell length j ← (int)(y/∆L) k ← (int)(z/∆L)

hx ← (x − (i × ∆L))/∆L

hy ← (y − (j × ∆L))/∆L

hz ← (z − (k × ∆L))/∆L Fill weights[] array return weights, indices end procedure

Using the above definition, we calculate the RHS with a given list of particles, using Algorithm6.

Algorithm 6 Update the charge fractions contribution to calculate the charge density. 1: procedure updateChargeFractions(pList, rhsArr) 2: for p ∈ pList do 3: weights[], indices[] ← getCellWeights(p)() 4: for ind ∈ indices[ ] do

5: niis the nodal index

6: rhsArr[ind] += weights[ni] ∗ (p.charge) ∗ multF actor . multFactor adjusts for cell volume and constants 7: end for 8: end for 9: end procedure

3.7 Resort Particles

After a sufficient number of timesteps, the distribution of particles within the array need not correspond to any geometrical distribution. But from a cache point of

17 view, it would be very beneficial for the particles to be ordered within the array in terms of the axial distance from the capillary. In Algorithm6, given a p in the list of particles, we obtain the weights and indices for a corresponding cell. If we order it based on the axial distance, then whenever we fetch a particular particle, the other particles that belong to the cache block are more likely to be cache hits in the next search. For this purpose, after a set number of Poisson timesteps, the particle array is sorted with the axial distance as the criteria. For this the stdlib qsort function is used. It is very important to note that qsort’s worst case performance is O(n2) which occurs when the array is sorted or nearly sorted. So if we call this at every timestep, we might in fact have worsened performance. Hence we do the particle sorting every 20 or 30 iterations. We have not experimentally justified this number but only chose it as a suitable starting point. Only with timing routines, would we know if this interval degrades performance.

3.8 Resetting RHS Vector

After each Poisson timestep, we need to fill the RHS vector again with the charge density contributions. So the vector is reset to zero, but only on the interior points, so that we don’t affect the boundary nodes. This operation is straightforward to implement and has the form shown in Algorithm7.

Algorithm 7 Reset interior nodes in RHS to zero. procedure resetRHSInteriorPoints(rhsVec) for p ∈ rhsVec do p.val ← 0 end for end procedure

3.9 Moving Particles

We move the particles in the same manner as [2]. We evaluate the Electric Field from the potential using Equation (2.3) and calculate the force using

18 F~ = qE~

where q is the charge of the particle. We weight the Electric field to the particle’s location in a manner outlined in Section 3.6. We move the particle using

~v t+1 − ~v t m = F~ t p dt ~x t+1 − ~x t 1 = (~v t + ~v t+1) dt 2

where superscripts t and t + 1 represent old and new positions of particles respectively. So the overall steps are outlined in Algorithm8:

Algorithm 8 Moving Particles in an Electric Field. procedure moveParticlesInField(pList, EField) lostP articles ← 0 for p ∈ pList do weights[], indices[] ← getCellWeights(pos) for ind ∈ indices[ ] do efi += weights[ind] × EF ieldi[ind] . i refers to the ith component Move particle based on weighted efield if isParticleOutsideDomain(p) then lostP articles = lostP articles + 1 end if end for end for return lostParticles end procedure

3.10 Parallelization

To use the particle routines in a parallelized manner, we need more modifications. In most loop based routines, the parallelization is straightforward with OpenMP, but other issues can arise, based on the implementation.

19 3.10.1 Releasing Particles

Although we have an outer loop which is parallelized with OpenMP, the generation of random number in a multi-threaded setup needs a bit more care. The normal call to rand() is not applicable, since the function stores states within its call for the next invocation. So race conditions can exist here, since multiple threads can overwrite or read an inconsistent state. Although it may not cause a runtime crash, the better approach is to let each thread have its own seed state and call rand_r(). Further, Algorithm3 and Algorithm4 are serial in nature with dependencies based on the previous iteration. So we need to parallelize those aspects as well. We do this by letting each thread calculate the index where it would need to insert a new particle. Since we know the number of gaps beforehand, we can use that to figure out whether we need to insert within a gap or at the end of the domain. The subsequent modifications to Algorithm3 result in Algorithm9.

Algorithm 9 Release particles into the domain - Parallel

procedure releaseParticles(np, gapCount)

parfor i ∈ np do p ← randomizeParticleAttribs(random pick from MD data) val ← (gapCount − i) . The index where the thread can check now if val ≥ 0 then Insert into gap else Insert at domainCount − (1 + val) end if end parfor Adjust the bounds here as needed end procedure

3.10.2 Particles leaving the domain

We can parallelize an outer loop with OpenMP, as long as the subsequent operations can occur independently. Again, we let each thread calculate the indices from where the particles are going to be swapped. With the relevant modifications to Algorithm4 we now have Algorithm 10.

20 Algorithm 10 Fill up any remaining gaps - Parallel function swapGapsWithEndParticles(domainParticles, domainCount, lost- Particles, lostCount) parfor i ∈ len(lostParticles) do j ← (lostCount − i) domainP articles[lostP articles[i]] = domainP articles[domainCount − 1 − j] end parfor Adjust the bounds here as needed end function

Our focus now is on the parallelization of the charge fraction calculation. Parallelizing the Reset of RHS interior points was straightforward and simply consists of using a parfor and so is not explicitly outlined here. Since we use qsort, it is not amenable to parallelization and so we let only one thread to do the particle resorting.

3.10.3 Update Charge Fractions

When we try to parallelize the update of the charge fractions, we face a race condition in the update of charge density that belongs to a node. This happens when multiple threads work on particles that belong to all cells that share a particular node (like in Figure 3.4). We decided to handle the charge fraction update using omp atomic. For scalability this may be an issue as thread local arrays are preferable, but the cost of updating the global array later with a single thread might outweigh this benefit. So the modified algorithm would have the form shown in Algorithm 11.

21 Algorithm 11 Update the charge fractions contribution to calculate the charge density. 1: procedure updateChargeFractions(pList, rhsArr) 2: parfor p ∈ pList do 3: weights[], indices[] ← getCellWeights(p)() 4: for ind ∈ indices[ ] do

5: niis the nodal index 6: omp atomic

7: rhsArr[ind] += weights[ni] ∗ (p.charge) ∗ multF actor . multFactor adjusts for cell volume and constants 8: end for 9: end parfor 10: end procedure

3.10.4 Moving Particles

Once the Electric field is known, we can move each particle independently of the other. This was straightforward to implement with OpenMP and so we don’t show the parallelization explicitly here. However, each thread now returns its own version of the number of lost particles. So we have a per-thread array that stores the number of particles lost as seen by each thread. Then we let one single thread do the cumulative count of all lost particles. Using this, we can let all threads update the global array of particles, since the cumulative counts have been pre-calculated.

3.10.5 Performance

We carry out the performance and scaling runs for the particle routines on the LionXG system which has 2.6 GHz Intel CPUs with a single node having two sockets with eight cores each. Each core can support one thread. The current scaling investigation was carried out for 8 threads since we did not observe job requests with 16 threads being granted in sufficient time. We solve the Laplace equation and carry out 8000 timesteps. with the results shown in Table 3.1.

22 Threads 1 2 4 8 Release Particles 0.0689 0.0389 0.0406 0.0391 Swap Gaps 0.0013 0.0032 0.0091 0.0172 Move Particles 10.3219 5.1609 2.6290 1.3598 Total 10.3921 5.203 2.6787 1.4161 Table 3.1. Scalability study for the particle routines.

Moving of the particles takes the majority of the time and scales reasonably. Interestingly the times for the other parallel steps seem to increase with more threads. This can possibly be explained by False Sharing, i.e. when multiple threads change the data on the same cache line, then they invalidate the writes in turn and so that operation could become serialized. We may need to carry out a more accurate investigation or profiling with Intel VTune to verify if this is the cause. The speed up values for the particle moving routines and the total time, as well as the combined graph are shown in Table 3.2 and Figure 3.6.

Moving Particles Total Time 1 1 2 1.99 3.92 3.87 7.59 7.3386 Table 3.2. Speedup values for the Moving Particles routine and for the Total Time.

23 8 Move Particles Total Time 7

6

5

4 Speedup Values

3

2

1 1 2 3 4 5 6 7 8 Number of Threads

Figure 3.6. Laplace Particle Routine Scalability

3.11 Results-Particle Distribution

Running the code for 300 Poisson timesteps and using 4 threads, we obtained the following distributions:

24 Figure 3.7. Particle Distribution at Figure 3.8. Particle Distribution at Timestep 50 Timestep 100

Figure 3.9. Particle Distribution at Figure 3.10. Particle Distribution at Timestep 150 Timestep 250

After 300 timesteps, the view of the distribution from the front is

25 Figure 3.11. Front view of distribution at 250 timesteps

We see that the “beam” of particles are highly collimated along the axial direction. This is expected since the Electric field has a large magnitude along the axial direction. With different capillary and extractor radii, we can expect the particle distribution to change.

26 Chapter 4 | Poisson Solve

For the problem at hand, we solve the linear system Ax = b, where A represents the discretization of the PDE, x refers to the solution vector (here, the array of potential values for each node in the grid) and b refers to the RHS in Equation (2.1) (charge density by permittivity) at each node in the grid. As explained in Chapter2, we use the finite difference method for the grid discretization. The 3D Laplace stencil is used for the interior points and Neumann boundary condition is represented using a one-sided second order approximation. The matrix A roughly has the form

  1 ...    . .   . ..      1 1 1 −6 1 1 1 ......     .   .   . 1 1 1 −6 1 1 1 ...     . .   . . 1 1 1 −6 1 1 1     ......   . . . . . −1 −3 4 .     ......  ......

Figure 4.1. Form of Matrix A with coefficients in case of Laplace equation.

where the Dirichlet nodes are represented with a row of [1,...... ] and the Neumann nodes are represented with [... 3, −4, 1 ... ]. Interior points have the usual 7-point stencil with the coefficient row of [..., 1, 1, 1, −6, 1,..., 1, 1,... ] and as shown in Figure 4.2.

27 Figure 4.2. 7-pt stencil with Axes convention.

4.1 Initial Serial Solver

The RHS vector b is mostly zeros, since we are solving the Laplace equation for the initial approximation. As a first iteration towards a solver, the initial implementation was a Successive-Over-Relaxation (SOR) method, due to its ease of implementation. Given A as the matrix shown earlier, we define D, −L and −U as the diagonal, strictly lower-triangular and strictly upper-triangular parts respectively, i.e. A = D − L − U. Then the matrix representation of the SOR form is (as taken from [12])

x(k) = (D − ωL)−1[ωU + (1 − ω)D]x(k−1) + ω(D − ωL)−1b

where ω is the extrapolation factor. Typically ω  2 and at an optimum value, the convergence is the most accelerated.

4.1.1 Matrix-free approach

The matrix A is quite sparse as we can see from Figure 4.1. The corresponding sparsity pattern is as shown in Figure 4.3.

28 Figure 4.3. Sparsity Pattern of matrix A.

So instead of storing the entire matrix, a common approach is to just store the non-zeros using any of the standard Compressed formats. Even with a sparse row format, a lot of space would be wasted storing redundant coefficients. When it comes to the Laplace problem in particular, we see that most of the row coefficients are approximately the same. The bulk of the calculations occur for the interior points which represent the Laplace stencil. Hence we implemented the SOR method (and subsequent solver implementations) using a Matrix-Free approach. Consider evaluating the product Ax as before with b = [0]:

29   1 ...       x1 0  . ..       . .   .       .  0       1 1 1 −6 1 1 1 ......        x  0  .   4    .       . 1 1 1 −6 1 1 1 ...  x  = 0    5    . .       . . 1 1 1 −6 1 1 1  x  0    6    ......       . . . . . −1 −3 4 .  x7 0        ......  0 ...... x8

Figure 4.4. Evaluating Ax = b.

Since memory-access is typically a bottleneck for storing A, we can represent the previous matrix-vector operation in a “matrix-free” way as:

x1 ← 0 for i ∈ [4, 5, 6, 7] do

xi = (1./6)(xi−3 + xi−2 + xi−1 + xi+1 + xi+2 + xi+3) end for

x7 = (1./3)(4x8 − x6) By using the above approach, it is clear that the matrix coefficients are implicitly fixed as part of the implementation and does not have to rely on fetching redundant coefficients from memory. The SOR implementation used the matrix-free method and one loop of the solver is as shown in Algorithm 12.

Algorithm 12 Successive Over Relaxation - Solver Loop procedure SORSolve(x, b) for p ∈ interiorPoints do

xp = (1./6)(xp−3 + xp−2 + xp−1 + xp+1 + xp+2 + xp+3) end for for p ∈ neumannPoints do Enforce second order Neumann boundary condition. end for end procedure

But for solving the Poisson equation, we would need to solve at each timestep. Just for the Laplace equation alone, the SOR solver was seen to take too long at

30 around 287 seconds for a (101 × 101 × 101) grid. For repeated solves, the SOR solver was too slow and we needed a better solver. For this purpose, we setup the same problem and solved it using PETSc since it allows us to quickly explore combinations of Solvers and . For this PETSc code however, we used a matrix based approach and the A matrix was built with a Compressed Sparse Row (CSR) format. We did this since many of the examples and tutorials were based on building the A matrix. Although PETSc has a matrix free solver, we felt it was specialized for a different PDE.

4.1.2 Exploration of Solver Options with PETSc

We tried different combinations of Solvers and Preconditioners for a 101 × 101 × 101 grid. With PETSc-3.4.3, we obtained the following times in Table 4.1.

Solver Preconditioner Time Taken (sec) bcgs SOR 38 Hypre DIVERGED ILU 32 Multigrid 56 Additive-Schwarz 38 bcgsl Multigrid 64 gmres SOR 248 Table 4.1. PETSc-3.4.3 Solver and Preconditioner combinations explored.

With the option of Biconjugate Gradient narrowed down, we tried the next set of options on the Preconditioner with PETSc-3.6.2, since we felt the more up to date version might have the latest optimizations and would be a better benchmark to aim for. The time taken for different cases are outlined in Table 4.2.

Solver Preconditioner Time Taken (sec) bcgs Multigrid 23 Hypre 21 MG+Hypre(coarse grid) 17 Table 4.2. PETSc-3.6.2 Solver and Preconditioner combinations explored.

31 After trying out various combinations, we see that the Biconjugate Gradient solver with the Multigrid preconditioner is an effective combination with a time of 23 seconds. Although the Hypre preconditioner attempts give a lower running time, it relies on Hypre’s BoomerAMG preconditioner which is an Algrebraic Multigrid method. Due to its complexity of implementation and the various flavors that the Hypre preconditioners support, we decided to focus on the simpler Geometric Multigrid method. From here on, our aim was to generate an independently written solver with support for multi-threading and use it as a Poisson solver, to be called at each timestep. In terms of ease of implementation and potential for parallelism, we started work on using the Multigrid method as a Solver with the eventual aim of using it as a preconditioner for the Biconjugate or Conjugate Gradient (which needs the matrix A to be symmetric) method as a solver (Refer Chapter6). The current work still uses Multigrid as a Solver as described in Section5.

4.2 Results

We carry out the performance investigation on the Ganga system at the Department of Computer Science and Engineering at Penn State. The configuration is as shown in Table 4.3.

CPU Intel Xeon E5-2620 Sockets 2 Cores 6 Threads per core 2 Logical CPUs 24 Compiler GCC 4.4.7 Table 4.3. Details of Performance Setup on a Single Node on Ganga Cluster.

We generated a grid with a coarse size of 3 and 8 multigrid levels. This corresponds to 257 grid points on one side. We carry out 500 Poisson timesteps and time the subcomponent functions as well.

32 The timings for the subroutines and the total time as well as the calculated speedups are shown in Table 4.4.

Threads 1 2 4 8 16 Release Particles 0.0188 0.0104 0.0097 0.0103 0.0119 SwapGaps 0.0004 0.0006 0.0011 0.0029 0.0038 Resort 0.0422 0.0369 0.0420 0.0560 0.1447 ResetRHS 9.9935 5.2079 4.4904 4.1384 4.2552 UpdateChargeFrns 1.2772 0.6478 0.4329 0.2637 0.2319 Solve 22 360.6440 8211.8040 4222.3612 2191.7874 2173.6223 CalcElecField 86.4144 36.1461 20.4654 16.0485 19.0116 MoveParticles 0.8549 0.3531 0.1991 0.1040 0.0745 Total 22 459.2454 8254.2068 4248.0018 2212.4112 2197.3559 Speedup 1 2.7209 5.2870 10.1514 10.2210 Table 4.4. Scaling results for the Poisson Run with 500 Timesteps.

To get an overall idea of the scaling, we plot the speedup in Figure 4.5.

33 16

14

12

10

8

6 Speedup for Poisson Setup

4

2

2 4 6 8 10 12 14 16 Number of Threads

Figure 4.5. Graph indicating scaling for Poisson Run.

Interestingly, we notice a scaling that is better than expected and then it seems to saturate at around 16 threads. The better scaling can probably be explained by improved cache locality for the spawned threads if they get mapped to different cores and have less interference from other threads writing to shared cache lines. The overall scaling does not seem to increase beyond 8 threads and we see that the timings do not change much for all routines. Since the Solver is the largest contribution for the time here, we describe the Solver implementation and describe its performance next.

34 Chapter 5 | Multigrid Method

We used the Multigrid method as the Solver here, due to its ease of implementation and the parallelizable steps (Section 5.5). Multigrid methods have been well adapted to elliptic problems such as Equation (2.1), as outlined in [13]. The method has different “cycles” or order of visiting the fine and coarser grids in the hierarchy. They are also known as V-Cycle, W-Cycles and F-Cycles. We chose to implement the V-Cycle, since its implementation was straightforward as well. The Multigrid method uses a hierarchy of grids with the number of “cells” on one side, halved (usually) on a successive coarser level. Broadly speaking, this hierarchy of grids is useful in spreading information across the grid quickly. More rigorous mathematical analysis has been used to show that the Multigrid Method obtains its efficiency by basically estimating the error using different grids and carrying out corrections with this estimation.

5.1 Methodology

The basic steps of a V-Cycle Multigrid are well known [13] and we use a two-level grid approach whose steps are:

1. Pre-Smoother: Carry out a few iterations of a smoother (typically a choice among damped Jacobi, Gauss-Seidel and Chebyshev iterations) to smoothen out the higher frequency errors.

2. Restrict Residual: Calculate the residual on some grid level i and transfer this to the coarser grid. This operation is typically called a Restriction.

35 3. Recurse/Direct Solve: Redo the same procedure on the coarser grid. The low frequency errors in the fine grid now become high frequency errors on the coarse grid and are effectively damped. If the coarse grid is the last level grid, then do a Direct Solve (with the LU method, say).

4. Prolongation: The coarse grid solution is now the correction for the finer level grid. Interpolate these corrections back to the fine grid and correct the fine grid solution. This operation is typically called a Prolongation.

5. Post-smoother: Run the iterations of the smoother to spread the information and damp out high frequency errors again.

We further explain each of the steps in greater detail.

5.1.1 Pre/Post Smoother

The role of the smoother is typically to smoothen out the high frequency errors. We chose to use the Red-Black Gauss Seidel smoother since it is amenable to parallelization. If the Multigrid method is used as a preconditioner for a method like Conjugate Gradient, we need to satisfy additional restrictions on the pre and post-smoother as outlined in [14] to ensure that the Multigrid method is effectively symmetric.

5.1.2 Restriction

The residual (denoted by r = b − Ax)is calculated at each of the nodal points. We then “restrict” or transfer these residuals to the coarser grid, based on some form of weighted averaging. We used the Restriction matrix (R) from [15] but implemented it in a Matrix-free manner. This is equivalent to carrying out a filter-like operation on all interior points of the grid.

5.1.3 Recurse/Direct Solve

In the two-level grid approach, each level recurses to the coarser level. If the final level is the last level grid, then the system of equations are solved exactly using a Direct Solve. In the current code, we implemented the LU based method for solving directly. The matrix A (for the coarsest level) is converted and stored into

36 an LU form, using the same memory as the original matrix. Our method is still matrix-free since we store the A matrix in its LU form for the coarsest level only. Once the coarse level solution is known, the next step is to prolongate.

5.1.4 Prolongation

The coarse level solution represents the “error” correction for the finer level grid. These are then interpolated onto the finer grid and we followed the same procedure of representing the Prolongation matrix (P ) from [15] in a matrix free manner. An important condition is that the Prolongation operator is the transpose of the Restriction operator [13], i.e.

P = c(R)T (5.1)

This is also one of the conditions to ensure that the Multigrid method remains symmetric if used as a preconditioner. In practice, we observed that convergence is accelerated when this condition is enforced as well. Using the “error” from the coarser grid, we prolongate the error at each point, based on the coarser grid solution and add it to the fine grid solution value.

5.1.5 Matrix-Free Approach

We implemented the multigrid solver in a matrix-free manner as well for the same advantages outlined in Section 4.1.1. All steps of the Multigrid method are amenable to this as well.

1. Pre/Post-smoother: We implement the Gauss-Seidel approach using the 7-pt stencil as shown in Figure 4.2 for the interior nodes. Since the interior stencil coefficients are constant, we represent the matrix-vector operation directly in terms of loops. Both Neumann and Dirichlet conditions are enforced as part of the Smoother.

2. Restriction: The R matrix can be implemented in a matrix free manner. When Neumann conditions are involved, there is normally a need for Restriction on the boundaries. But due to the matrix-free smoother implementation, the residual on the boundaries was enforced to be zero, based on the order of visiting nodes. (Refer Section 5.3.2).

37 3. Recurse/Direct Solve: We construct the matrix for the coarsest level alone. We pass in the number of points (on one side of the cube) as a parameter (usually 3 or 5 points) and construct the coarse matrix appropriately and convert and store in its LU form (done only once).

4. Prolongation: Again we implement this in a matrix free manner, which required separate checks for whether the fine grid node was in the interior/- face/edge of the coarse grid cell.

We prototyped the initial version of the Multigrid solver in MATLAB® and built it up from a simple 1D version to 3D version, with and without a matrix based approach. The matrix-free approach was consistently quicker (especially at 3D) and so the benefit of a Matrix-Free approach was clear in terms of performance. There are some caveats towards such an approach though. When the A matrix is explicitly stored (in a matrix based approach), then the smoother, residual calculation and coarse level matrix construction can be specified in terms of matrix operations alone. So any change in A automatically takes care of these operations. With a matrix-free approach however, the smoother, residual calculation and coarse matrix construction have to be updated individually since a change in the matrix coefficients means an explicit change in the way that these operations are implemented. This was a bit error prone initially, but we did not see a better way to auto- matically handle all changes in the matrix. With all the stages of the multigrid method defined, the overall structure of the V-Cycle is as shown in Algorithm 13.

38 Algorithm 13 Steps in the V-Cycle Multigrid Method 1: procedure VCycle(u, b, r, l, LU) 2: if l < maxLevel then 3: u[l] ← 0 . Reset lower level solution to 0. 4: end if 5: if l == 0 then 6: LUSolve(u[0], b[0], LU) 7: return u[0] 8: end if 9: Pre-Smoother(u[l], b[l]) 10: GetResidual(u[l], b[l], r[l]) 11: Restrict(r[l], r[l − 1])

12: VCycle(u, b, r, l − 1, LU)

13: Prolongate(u[l − 1], u[l]) 14: Post-Smoother(u[l], b[l]) 15: return u[l] 16: end procedure

5.2 Implementation

We represent the hierarchy of grids as a double** in the code, i.e. an array of arrays. We generate and store only the coarsest level A matrix to its LU form. We store a hierarchy of grids for the quantities u, b, r, corresponding to the solution (x previously), the RHS quantities and the residual respectively. Although it is not necessary to store the hierarchy for r, the residual vector, the alternative would have been to malloc a vector at each level and free it later. Since repeated malloc calls can be costly (for heap memory allocation and consolidation), we preferred to preallocate the residual vector. So if we instantiate the solver with coarse grid point of 3 and 3 levels, the layout would be as shown in Figure 5.1 (for simplicity, showing a 2D case):

39 Figure 5.1. Allocation of hierarchy of grids for a 2D face.

The number of points at levels 0, 1, 2 would be (3, 5, 9) respectively. The corresponding V-Cycle is shown in Figure 5.2.

Figure 5.2. Form of the V-Cycle.

40 5.3 Validation

5.3.1 Dirichlet conditions

We validate the case of the Multigrid solver with Dirichlet conditions by solving Laplace equation for a specific analytic function. If we choose a particular analytic function that satisfies the Laplace equation and enforce this on the boundaries, then the same analytic condition must be satisfied by all interior points. We can

easily verify this by assuming two different solutions Φ1(x, y, z) and Φ2(x, y, z) for the same boundary conditions, i.e.

2 ∇ Φ1 = 0, f1(x, y, z) on boundaries 2 ∇ Φ2 = 0, f1(x, y, z) on boundaries

Consider the difference between the two cases. Now we would have zero boundary condition on the boundaries.

2 ∇ (Φ1 − Φ2) = 0, f(x, y, z) = 0 on boundaries

The Laplace equation basically averages neighboring points for each point. So interior points cannot have values higher than their neighbors. Starting from the nodes closest to the boundary, we can see that the final solution would have to be zeros at all interior points (since all points on the boundaries are zeros), i.e.

(Φ1(x, y, z) − Φ2(x, y, z)) = 0

⇒ Φ1(x, y, z) = Φ2(x, y, z)

So the solution must be unique and must match the analytic function itself. We 2 2 2 use the function f1(x, y, z) = x − 2y + z in this case since it satisfies the Laplace equation. We run the solver till the initial residual was reduced by a factor of 1e − 8 (Relative Tolerance). We then compare the final solution against the analytic one and the calculate the error norm. The results are outlined in Table 5.1 and the

41 iterations taken is plotted in Figure 5.3.

Points Iterations Error Norm Time Taken Residual on one Taken (sec) Reduc- side tion 5 3 3.04e-10 9.8e-05 0.0054 9 7 7.657e-08 2.57e-04 0.0706 17 8 5.273e-07 0.0014 0.1064 33 8 3.2042e-06 0.0102 0.1100 65 8 1.145e-05 0.1897 0.1090 129 8 3.4980e-05 0.7083 0.1060 257 8 1.0100e-04 5.4074 0.1018 Table 5.1. Validation of Multigrid Solver with Dirichlet conditions.

10

8

6

4

Iterations Taken for Convergence 2

0 0 50 100 150 200 250 300 Number of Points on One Side of Domain

Figure 5.3. Iterations taken for varying Problem Size.

42 The plot exhibits desired features of a Multigrid method. It is well known that the Multigrid method achieves convergence in the same number of iterations (for a given relative tolerance), regardless of problem size. The plot demonstrates this and the Residual Reduction Ratio stays close to 0.1. This also validates that it should take around 8 iterations to reduce the residual by 10−8.

5.3.2 Mixed Neumann

In order to simulate the actual problem, we need to incorporate the Neumann boundary conditions. Since the problem has at least a few nodes with Dirichlet conditions (the Capillary and Extractor, from Figure 2.1), there is no null space in the Solver to account for [16]. To validate the solver in this case, we used a problem setup of a 3D cube with the X = 0 face at the capillary voltage (0 V) and the X = END face at the Extractor voltage (-1350 V) and homogenous Neumann condition on all other faces. In this case, first order Neumann boundary conditions were used, for ease of implementation. The analytic solution in this case would be a linear solution in the voltage, starting from the X = 0 to X = END face. We used the same residual tolerance as earlier and the error norm w.r.t to the analytical solution was calculated. The results are shown in Table 5.2 and Figure 5.4.

Points Iterations Error Norm Time Taken Residual on one Taken (sec) Reduc- side tion 5 17 4.207e-05 2.748e-04 0.3326 9 17 2.267e-04 6.3e-04 0.3475 17 17 0.002083 0.002988 0.3796 33 17 0.0293295 0.020375 0.4330 65 18 0.178236 0.4610 0.4914 129 19 0.9275 2.0724 0.5414 257 19 6.8018 14.675 0.5806 Table 5.2. Validation of the Multigrid solver with Neumann Boundary Conditions.

43 20

19

18

17

Iterations Taken for Convergence 16

15 0 50 100 150 200 250 300 Number of Points on One Side of Domain

Figure 5.4. Iterations taken for varying Problem Size

We see that the error norm is noticeably larger and so is the Residual Reduction. It is possible that the larger error norm is due to the first order approximation, although by manual inspection of grid point values, we consistently observed very close agreement with the analytical solution. The residual reduction ratio has also increased, but we see the nearly constant convergence behavior. The increase in residual reduction is possibly a known feature of the Multigrid method when Neumann boundary conditions are introduced. Extra effort must be made to improve convergence with Neumann boundary conditions and according to [17], seemingly straightforward attempts to extend a Dirichlet based code to Neumann do not perform as well. While developing the Multigrid code (using MATLAB), we sometimes also observed poor convergence with Neumann conditions when using a Matrix based approach. This was due to the residuals being non-zero on the boundaries on the finer grid and these quantities were not restricted to the coarser grid. So by applying a 2D Restriction operator on the boundary faces, we obtained much better

44 convergence. In the matrix-free version however, by choosing the order of visiting the nodes, we can guarantee that the residuals stay zero at the nodes. So in this case, once the interior point value is updated, we copy the value to an adjacent Neumann node if it exists. By using this approach, we ensure that residuals on the boundaries stay at zero and no separate restriction operation is necessary there.

5.3.3 Small domain length issue

For the initial Laplace solve, we observed that beyond a point, the residuals never decreased and the Residual Reduction Ratio stayed at 1. This was tracked down to the domain spacing h being so small, that residual calculations involving 2 2 (bi − (ui−1 + ui+1 + ... )/h ) had spurious values due to the 1/h term. A current workaround for this issue is to iterate only up to a MAX_ITER value at which point the solution seemed converged. Subsequent Poisson iterations did not trigger the MAX_ITER condition, so we encounter this condition only once.

5.3.4 Troubleshooting Multigrid

Debugging Multigrid methods is particularly challenging as the fine-grid error correction can mask issues and produce correct results even when there are bugs [17]. One of the first clues that our initial multigrid implementations was incorrect, was due to the non-constant convergence i.e. increasing number of iterations with respect to problem size. Multigrid properties are well studied and so the lack of the constant convergence was an important marker. In general, we noticed that the bug prone areas tend to be the Smoother, Residual Calculation and the coarse matrix construction, partly due to the matrix-free nature of the implementation. The Restriction and Prolongation in particular are very important and convergence is significantly improved when Equation (5.1) was imposed. The Residual Reduction Ratio is also an important clue and typical values range from 0.09, 0.1 ' 0.6. Much larger residual reduction values indicate a misbehaving step of the process. We found that the most useful way to develop the code was to build it from a one dimensional case and write test cases for the Smoother, Restriction and Prolongation steps to tease apart the effect of each component as we add layers of

45 complexity. Since the Multigrid method is very robust, it can still converge with a wrong implementation.

5.4 Comparison with PETSc

We ran the matrix-free version of the code on a single thread on a problem setup similar to Section 5.3.2 and compared it against PETSc-3.6.2 using the options -ksp_type richardson (to use MG as a solver and matrix based), -pc_type mg and -da_refine 6 to use 6 levels of Multigrid, which corresponds to 65 nodes on a point. The standalone code took 0.46 seconds, whereas the PETSc code took around 10 seconds. We attribute this difference to the matrix-free approach in our code versus the matrix based approach and so this is not entirely a fair comparison. Although PETSc has a matrix-free example, the governing equation is different and we did not have much time to adapt it to the current problem. With increasing multigrid levels, we noticed that the difference in the time between our code and PETSc was more prominent, which is expected. We felt that this approach validates the advantages of a matrix-free implementation tuned to the problem of interest.

5.5 Parallelization

Almost all the steps of the Multigrid method are parallelizable. Here we outline the steps taken to parallelize various sub components of the method.

5.5.1 Smoother

The Red-Black Gauss Seidel smoother is typically a good choice for parallel smooth- ing, with the coloring on alternate cells as shown in Figure 5.5.

46 Figure 5.5. Red-Black coloring

The main barrier to parallelization in a typical Gauss-Seidel smoothing scheme is the dependency between neighboring points. The previous value has to be evaluated and updated before moving on to the next value. In a Red-Black scheme, either a Red or Black sweep (where a sweep is an iteration over the particular color of nodes) is carried out first. Since a nodal point has no neighbors of the same color, we avoid the earlier dependency and this facilitates straightforward parallelization. Then a sweep over the other nodes are then carried out as shown in Algorithm 14.

Algorithm 14 Red-Black Gauss Seidel Outline. procedure RedBlackGS(x, b) parfor p ∈ RedNodes do . Red sweep p.val = Smoothen(p, neighbors(p)) end parfor Synch() . Synchronize threads here parfor p ∈ BlackNodes do . Black sweep p.val = Smoothen(p, neighbors(p)) end parfor end procedure

5.5.2 Direct Solve

We used the LU method for the direct solve and this was not parallelized since a typical solve needs values of the previous variables. We use the direct solve only on the coarsest level, so the constructed matrix is not very large and the time taken

47 for a single direct solve is very small as well. So we let one thread carry out the direct solve in this case.

5.5.3 Remaining steps

The rest of the steps such as the Calculation of Residual, Restriction and Pro- longation are all straightforwardly parallelizable with OpenMP. The Restriction and Prolongation in particular are like filter operations on an image and can be expected to scale well. Since the parallelization is straightforward, we don’t show explicitly outline them here.

5.5.4 Results

We show the total timing runs for a Multigrid Dirichlet problem similar to Sec- tion 5.3.1 here at Table 5.3. We used 9 multigrid levels and a coarse grid of 3 points on each side. This means that we would have ((3 − 1) × 28) + 1 = 513 nodal points on one side, on the finest level. We chose a higher number of levels to encourage more work per thread and varied counts from 1 to 8. Although 16 threads can be supported on the LionXG cluster (2 sockets, 8 cores per socket, 1 thread per core), we saw that such job requests were never granted, so we present the results for up to 8 threads (within one socket). The corresponding speedup plot is shown in Figure 5.6.

Number of Threads Time Taken (sec) Speedup 1 44.9766 1 2 24.3285 1.8487 4 16.4777 2.7295 8 18.5025 2.4308 Table 5.3. Multigrid Dirichlet with 9 levels and 3 points on coarsest level.

48 8

7

6

5

4

3 Speedup for MG Dirichlet Setup

2

1 1 2 3 4 5 6 7 8 Number of Threads

Figure 5.6. Multigrid Dirichlet (9 levels, 3 coarse points) Speedup graph.

We see that the scaling performance is not linear. Since the Multigrid steps consist of many substeps, need to look at the timings of each component and check the scaling behavior there as well. The timing results for this case are investigated next.

5.5.5 SubComponent Timings

We further added timing routines (in seconds) to each of the steps. The variation of the time taken (by each component) in each result in Table 5.3 with the number of threads are as shown here in Table 5.4.

49 Threads 1 2 4 8 Pre-Smoother 12.3647 6.7035 5.2014 6.8667 CalcResidual 2.7084 1.5016 1.0157 1.1127 Restriction 2.3214 1.1772 0.5994 0.3203 Recurse 5.6219 3.1719 2.0034 1.4651 Prolongation 7.2641 3.7406 1.8728 0.9530 Post-Smoother 12.3879 6.8248 5.0875 6.8901 CalcResidual 2.3078 1.2080 0.6960 0.8903 Table 5.4. Timings for the individual subcomponent steps in the Multigrid Dirichlet case.

From a preliminary analysis, we see that the Smoother does not scale as well as we would expect it to. Similarly neither does the Residual calculation. Since it contributes a smaller percentage of the overall time, we don’t currently consider its scaling.

5.5.5.1 Smoother

The smoother operations take the most time and so have the biggest impact in terms of overall scalability. The pre and post smoother exhibit the same behavior and the speedup results (calculated from Table 5.4) are as shown in Table 5.5 and plotted in Figure 5.7.

Number of Threads Speedup 1 1 2 1.8445 4 2.3772 8 1.8007 Table 5.5. Scalability study of Smoother from the Subcomponent Timings.

50 8

7

6

5

4 Speedup of Smoother 3

2

1 1 2 3 4 5 6 7 8 Number of Threads

Figure 5.7. Scalability plot of Smoother from the Subcomponent Timings.

The speedup of the smoother closely follows that of the total time and so the smoother performance is not optimal and we need to investigate this further to understand why. So the scaling of the Poisson simulation is limited by the scaling of the Solver.

5.5.5.2 Restriction and Prolongation

The Restriction and Prolongation are expected to scale well since they are similar to filter operations. The speedup values for each of the operations are calculated from Table 5.4 and outlined in Table 5.6 and plotted in Figure 5.8.

51 Number of Threads Restriction Prolongation 1 1 1 2 1.9719 1.9420 4 3.8729 3.8788 8 7.2477 7.6227 Table 5.6. Tabulation of Restriction and Prolongation Scalability.

8 8

7 7

6 6

5 5

4 4 Restriction Speedup Prolongation Speedup 3 3

2 2

1 1 1 2 3 4 5 6 7 8 1 2 3 4 5 6 7 8 Number of Threads Number of Threads

Figure 5.8. Plot of Restriction and Prolongation Scalability.

5.5.6 Smoother Performance

From Section 5.5.5.1, it is clear that we need to analyze the smoother further. We wanted to investigate if cache performance was the issue and used Cachegrind [18]

52 to instrument the run. To further understand the profile results, we need to be aware of the existing memory hierarchy for the LionXG cluster which is shown in Table 5.7.

L1D cache 32K L1I cache 32K L2 cache 256K L3 cache 20480K Table 5.7. Details of the Memory Hierarchy for LionXG.

If 8 multigrid levels are allocated with the number of coarse grid points as 3, then the final level has (28 + 1) = 257 nodal points. A 3D array of 256 points (each representing a double value) would roughly correspond to 129 MB. This far exceeds the Last Level cache shown above of 20 MB. It is possible that there is a degradation of performance due to cache fetches. To verify this, we ran Cachegrind on a sample problem with only the Red-Black Gauss Seidel smoother to solve the Dirichlet problem, similar to Section 5.3.1. Although the run exceeded the job length of around 4 hours, we still obtained cache statistics which are highlighted below in Table 5.8.

Quantity Cumulative % Component 1 Component 2 D1 miss rate 11.2 12.3 % (read misses) 0% (write misses) LLd miss rate 5.6 6.1 % (read misses) 0% (write misses) LL miss rate 1.5 1.5 % (read misses) 0% (write misses) Mispred rate 1.1 1.1 % (cond branches) 4.1% (indirect) Table 5.8. Cachegrind Performance Results

From the above table, it is clear that cache performance is not the issue in the current setup. In order to obtain timing statistics, we setup another smoother scalability test with Multigrid levels of 7 (corresponding to 129 points), since the test takes too long for a grid with 8 levels (257 grid points). This ran to completion and the timings are shown in Table 5.9 and plotted in Figure 5.9.

53 Number of Threads Time Taken Speedup 1 129.2950 1.0000 2 60.1077 2.1511 4 29.4238 4.3942 8 15.5470 8.3164 Table 5.9. Standalone Smoother Scalability Setup with 7 Multigrid levels and 3 coarse grid points.

9

8

7

6

5

4

3 Speedup of Standalone Smoother

2

1 1 2 3 4 5 6 7 8 Number of Threads

Figure 5.9. Smoother Scalability with 7 Multigrid levels.

We can see that the speedup performance is good for 8 threads. So we still don’t know why the smoother performance seems to degrade when we use it as part of the Multigrid method. It is possible that factors such as a busy node can play a role during the timing run, but multiple timing runs on the Smoother in Multigrid did reproduce the degraded performance. We need to further investigate why the smoother does not run as well, when used as part of the Multigrid method.

54 5.6 Code Results

So far we have only spoken about the and other performance details. In this section, we talk about and present the actual outputs from the code. The output files are written in VTK format and post processed using VisIt [19].

5.6.1 Potential Distribution

The potential distribution and Electric Field are as shown:

Figure 5.10. Potential Distribution at Figure 5.11. Electric Field at Timestep 50 Timestep 50

We can clearly see in Figure 5.11 that the Electric field lines form “streamlines” that start from the capillary and move towards the Extractor, qualitatively indicat- ing that the field is correct. However the plot does not have the vector scaled by the magnitude. If that were the case, the only noticeable vectors would be near the capillary and the rest of the vectors would be negligible.

5.6.2 Effect of Particles on Potential

The potential contours at timesteps 50 and 250 are shown:

55 Figure 5.12. Potential Contours Figure 5.13. Potential Contours at Timestep 50 Timestep 250

Although it is not clearly visible from the images, the contours shift slightly to the right with a minor “bump”, which indicates the effect of the charges. By solving the Poisson equation, the charge-to-charge interaction is also captured by using the grid and this causes a slight expansion in the potential region.

56 Chapter 6 | Conclusions and Future Work

We developed a standalone C code, ES-PICbench, for solving a Particle-In-Cell based problem for understanding the simulation of a colloid thruster. We parallelized the code with OpenMP and implemented a Multigrid Solver as well as routines to move particles in an Electric Field. Our standalone parallelizable code was seen to perform up to 540× faster than the previous solution with Deal.II. We also demonstrate that our simpler implementation can outperform more heavy duty versions implemented with libraries which may have their own dependencies and constraints for parallelization. Scaling performance of our code was limited significantly by the scaling of the Multigrid solver. This was possibly due to the scaling behavior of the Red-Black Gauss Seidel smoother. The code for ES-PICbench is available online and hosted at [20] for evaluation and usage.

6.1 Future work

We felt that there are some aspects to explore for future work in this area. Firstly, the Multigrid method is often used as a preconditioner for a solver method like Conjugate Gradient and this is seen to perform very well for symmetric, positive definite systems [14]. This in turn may require changes in the smoother order to ensure correct reverse ordering of visiting nodes to make sure the system is symmetric and positive definite. If we can’t preserve the symmetric nature, we could use the Biconjugate Gradient method as a Solver instead, since that was one of the options in Section 4.1.2. We would need to analyze the Scaling issues with the Solver resolve them by

57 identifying bottlenecks. Currently, the reason for the non-optimal scaling of the Solver is not clear. In a related issue, it is also preferable to avoid atomics and barriers, so thread-local storage arrays should probably be used in Algorithm 11.

58 Bibliography

[1] Provost, S., C. Theroude, and B. Pezet (2004) “Insight into EADS Astrium Modeling Packages for Electric Propulsion/Spacecraft Interactions,” in Proc. 4th International Spacecraft Propulsion Conference, pp. 1–6.

[2] Wang, P., A. Borner, B. Korkut, Z. Li, and D. Levin (2013) “Simula- tions of Electrospray in a Colloid Thruster with High Resolution Particle-in-Cell Method,” in Proc. 44th AIAA Plasmadynamics and Lasers Conference, pp. 1–37.

[3] Balay, S., S. Abhyankar, M. F. Adams, J. Brown, P. Brune, K. Buschelman, L. Dalcin, V. Eijkhout, W. D. Gropp, D. Kaushik, M. G. Knepley, L. C. McInnes, K. Rupp, B. F. Smith, S. Zampini, and H. Zhang (2015), “PETSc Web page,” http://www.mcs.anl.gov/petsc. [4] ——— (2015) PETSc Users Manual, Tech. Rep. ANL-95/11 - Revision 3.6, Argonne National Laboratory. URL http://www.mcs.anl.gov/petsc

[5] Balay, S., W. D. Gropp, L. C. McInnes, and B. F. Smith (1997) “Efficient Management of Parallelism in Object Oriented Numerical Software Libraries,” in Modern Software Tools in Scientific Computing (E. Arge, A. M. Bruaset, and H. P. Langtangen, eds.), Birkhäuser Press, pp. 163–202.

[6] Korkut, B., P. Wang, Z. Li, and D. Levin (2013) “Three Dimensional Simulation of Ion Thruster Plumes with AMR and Parallelization Strategies,” in Proc. 49th AIAA/ASME/SAE/ASEE Joint Propulsion Conference, pp. 1–23.

[7] Bangerth, W., R. Hartmann, and G. Kanschat (2007) “deal.II – a General Purpose Object Oriented Finite Element Library,” ACM Trans. Math. Softw., 33(4), pp. 24/1–24/27.

[8] Bangerth, W., T. Heister, L. Heltai, G. Kanschat, M. Kronbichler, M. Maier, B. Turcksin, and T. D. Young (2013) “The deal.II Library, Version 8.1,” arXiv preprint 1312.2266v4.

59 [9] Borner, A. and D. Levin (2013) “Coupled Molecular Dynamics - Three Dimensional Poisson Simulations of Ionic Liquid Electrospray Thrusters,” in Proc. 33rd International Electric Propulsion Conference, pp. 1–17. [10] Khokhlov, A. M. (1998) “Fully Threaded Tree Algorithms for Adaptive Refinement Fluid Dynamics Simulations,” Journal of Computational Physics, 143(2), pp. 519–543. [11] “Particle In Cell,” https://mpnl.seas.gwu.edu/index.php/research/ modeling-and-simulation/9-particle-in-cell, last accessed March 2016. [12] “Successive Overrelaxation Method – Wolfram MathWorld,” http:// mathworld.wolfram.com/SuccessiveOverrelaxationMethod.html, last ac- cessed March 2016. [13] Briggs, W. L., “A Multigrid Tutorial,” slides at https://www.math.ust. hk/~mawang/teaching/math532/mgtut.pdf, last accessed March 2016. [14] McAdams, A., E. Sifakis, and J. Teran (2010) “A parallel multigrid Poisson solver for fluids simulation on large grids,” in Proc. ACM SIGGRAPH Symposium on Computer Animation, pp. 1–10. [15] Demmel, J., D. Vasileska, and M. Saraniti, “Multigrid Overview,” slides at https://nanohub.org/resources/9130/download/Multigrid_ method.pdf, last accessed March 2016. [16] “Solving PDEs – Firedrake documentation,” http://www.firedrakeproject. org/solving-interface.html#id17, last accessed March 2016. [17] Brandt, A. and O. E. Livne (2011) Multigrid Techniques, 1984 Guide with Applications to Fluid Dynamics - Revised Edition, SIAM. [18] Nethercote, N. and J. Seward (2007) “Valgrind: A Framework for Heavy- weight Dynamic Binary Instrumentation,” in Proc. ACM SIGPLAN Conference on Programming Language Design and Implementation, pp. 1–12. [19] Childs, H., E. Brugger, B. Whitlock, J. Meredith, S. Ahern, D. Pugmire, K. Biagas, M. Miller, C. Harrison, G. H. Weber, H. Krishnan, T. Fogal, A. Sanderson, C. Garth, E. W. Bethel, D. Camp, O. Rübel, M. Durant, J. M. Favre, and P. Navrátil (2012) “VisIt: An End-User Tool For Visualizing and Analyzing Very Large Data,” in High Performance Visualization–Enabling Extreme-Scale Scientific Insight, CRC Press, pp. 357–372. [20] Narayanan, R. K., “Compressed Archive of ES-PICbench code,” https:// psu.box.com/s/9821pzzg31f1yptn4g7b5lwpms8itngi, last accessed March 2016.

60