Spatial Data Structures, Sorting and GPU Parallelism for Situated-agent Simulation and Visualisation

A.V. Husselmann and K.A. Hawick Computer Science, Institute for Information and Mathematical Sciences, Massey University, North Shore 102-904, Auckland, New Zealand email: { a.v.husselmann, k.a.hawick }@massey.ac.nz Tel: +64 9 414 0800 Fax: +64 9 441 8181 March 2012

ABSTRACT Spatial data partitioning techniques are important for obtaining fast and efficient simulations of N-Body par- ticle and spatial agent based models where they con- siderably reduce redundant entity interaction computa- tion times. Highly parallel techniques based on con- current threading can be deployed to further speed up such simulations. We study the use of GPU accelerators and highly data parallel techniques which require more complex organisation of spatial datastructures and also sorting techniques to make best use of GPU capabili- ties. We report on a multiple-GPU (mGPU) solution to grid-boxing for accelerating interaction-based models. Our system is able to both simulate and also graphi- cally render in excess of 105 − 106 agents on desktop Figure 1: A visualisation of a uniform grid datastructure hardware in interactive-time. as generated by GPU, using ’s CUDA.

KEY WORDS thermore, a good scheme will also support - and not grid-boxing; sorting; GPU; thread concurrency; data impede - the introduction of parallelism into the com- parallelism putation so that an appropriate parallel computer archi- tecture [5] can be used to further reduce compute time per simulation step. As a vehicle, we use the Boids 1 Introduction model originally by Reynolds [6, 7]. Simulation problems involving interacting spatially lo- Spatial partitioning [8–10] has been employed as far cated entities such as N-Body particle models [1–4] back as the earliest attempts of Andrew Appel [11] who or spatial agent-based models often require consider- produced the first multipole partitioning method, with able computational power to support simulation of ad- many subsequent developments [12]. The N-Body par- equately large scale systems. Large size is important ticle simulation has served as a simple but effective test- to expose and investigate complex and emergent phe- ing platform for improving the performance of parti- nomena that often only appear on logarithmic length cle simulations and generally interaction-based models and systems size scales. To reduce the O(N 2) compu- which rely on particle-particle interactions [13]. The tational complexity of such interacting systems, spatial vast majority of particle simulations find themselves partitioning methods are often used. Depending upon suffering an inherent performance problem and a severe how effective the scheme used is, the computational lack of system size scaling. This is mostly due to par- complexity can be reduced down to O(N log N). Fur- ticle interaction with every other particle inducing an O(N 2) complexity. Even for a conceptually simple in- (a) Screenshot of our Boids sim- (b) Screenshot of the same sim- ulation. ulation state with visualisation of the grid-box algorithm super- posed. (a) A visual example of a K-d tree datastructure used Figure 2: Visualisation of the same set of agents with extensively on CPU implementations for spatial par- and without a uniform grid. titioning purposes. teraction such as a nearest neighbour (NN) interaction, it is still necessary, in the simplest cases, to perform cal- culations (such as Euclidean distance) for every combi- nation of two particles. Collision detection is one such problem. This problem is mitigated by using methods such as Barnes-Hut [14, 15] treecodes, Fast Multipole meth- ods [16], and tree organisational methods [17, 18] in- cluding Octrees [19–26], K-D trees [27, 28], Kautz trees [29], Adaptive Refined Trees [30], and more, all of which fall under the category of spatial partitioning. Not all of these algorithms would necessarily be suit- (b) K-d tree datastructure, with particles clustered so densely that the datastructure loses its effectiveness. able for any one particle simulation. Some of these al- gorithms are suitable for simulations which frequently Figure 3: Examples of datastructures used in other im- have high density in one or several spatial locations. plementations for the same purpose. Some perform best with uniformly distributed particles, and some attempt to systematically treat several parti- The former is normally the case when constructing a K- cles in the same light as a point mass to reduce compu- d tree by performing an O(NlogN) sort for every node tation in exchange for losing a small amount of preci- in the tree to obtain the median. The latter is the case sion. This technique is often well-suited to applications when using a linear median-finding algorithm, which in astrophysics. There also exists algorithms which is not trivial to implement on GPU, nor well-suited on sacrifice precision for performance, with techniques as GPU either. simple as updating only a certain number of agents per One space partitioning algorithm responds very well simulation time step, which dramatically increases per- to parallelism: grid-boxing [8] (sometimes known as formance but the loss of precision is so great it will no uniform grid space partitioning). The NVIDIA CUDA longer reflect the original simulation code. Choosing SDK comes with a particle simulator which makes use which one of these algorithms or methods to use in a of this technique [31]. Our single-GPU implementation simulation therefore ultimately depends on the purpose of grid-boxing is heavily based on this. of the simulation. Graphical Processing Units (GPUs) [32] lend them- This work contributes towards efficiency in the simula- selves particularly well to accelerating simulations es- tion of large multi-agent systems which do not require pecially after the advent of NVidia’s CUDA. CUDA is a high precision in very small clusters. To accomplish powerful addition to C++ which allows the programmer such a task in a single-threaded environment, one would to write code in C or C++ containing sections of special normally employ standard techniques such as Octrees, syntax which NVidia’s special compiler in the CUDA or K-d trees which are relatively simple to implement SDK compiles into GPU instructions before handing in single-threading. K-d trees offer some parallelisation the rest of the program code to the system’s built in C opportunities with some authors reporting piece-wise or C++ compiler. GPU datastructure construction algorithms which are of O(N log2 N) and also O(NlogN) complexity [27,28]. Figures 3(a) and 3(b) show examples of K-d trees, which are very popular for accelerating pro- Morton ordering (also known as Z-ordering or N- grams [27, 28]. Since many aspects of ray tracing ordering) is a very popular space-filling curve for the are inherently parallel, K-d trees have also received a purpose of ensuring spatial locality [36]. Treecodes large amount of research effort in order to offload as on GPU make use of Morton codes, which serve the much as possible computation from the CPU to GPU. purpose of encoding tree nodes for storing in hashta- For this reason, GPU K-d trees are of great interest, bles [20, 34]. We use this method to increase coalesced and numerous implementations already exist [33–35]. global memory reads. The CUDA architecture gener- Each of these implementations have different ratios of ally reads fixed-size blocks of memory, and caches this CPU code to GPU code. Significant parallelism can be for a short time when a thread requires it. It is for this achieved by building the tree level-by-level instead of reason that scattered reads from many threads earns a node-by-node, but even with this advantage, it is diffi- very large time penalty. cult to retain most computation on the GPU. Our article is structured as follows. In Section 2 we Algorithm 1 Single-GPU implementation of grid- present the method by which we construct the grid- boxing in boids. boxing datastructure, and also the method by which we Allocate and initialise a vec4 array as velocity use the result to accelerate body-body interactions with Allocate and initialise a vec4 array as position multiple-GPUs (mGPU) and also single-GPU in a stan- Allocate a uint array as hashes dard simulation of the Boids model as propounded by Allocate a uint array as indices Reynolds. Following this, in Section 3 we discuss the Allocate a uint array as gridBoxStart results we obtain and the approximate scaling we ob- Allocate a uint array as gridBoxEnd serve from this algorithm, from low numbers of agents Copy simulation parameters to device (16384) to excessive numbers (1,000,000+). In Sec- copyVectorsToDevice() tion 4 we discuss our results and implementation, and //For n frames. how it compares to other techniques. Section 5 provides for i ← 0 to n OR NOT exit condition do conclusions and future work that we may pursue in this i ← i + 1 area. //Calculate hashes for all boids. for j ← 0 to NUM BOIDS do hashes[j] = calculate hash(position[j]) 2 Method end for Sort by hash key (hashes,indices) We use NVidia’s CUDA as the parallelisation platform. Populate gridBoxStart and gridBoxEnd With this choice comes several restrictions, and perhaps Scatter write boids the most prominent of these comes with the memory Perform Boid kernel architecture. A CUDA device’s memory architecture copyVectorsFromDevice() is divided into several kinds: constant, global, shared, drawBoids() and texture. Each of these have different scopes and ac- swapDeviceBuffers() cess penalties. Global memory takes by far the longest end for at about 200 cycles, followed by faster constant mem- ory and very fast shared memory. The scope of these Algorithm 1 presents the single-GPU version of grid- range from application to block level, and also kernel boxing. All actions in this algorithm are performed in level. A CUDA block is conceptually a 1, 2 or 3D grid parallel, except for the loop containing the exit condi- of CUDA threads, whose dimensions are arbitrarily de- tion. This loop is simply used to advance to the next fined by the user. For optimal results, it is beneficial to frame. adjust these parameters to suit the program. This is es- pecially important to allow latency hiding when reading The very first kernel launched in this algorithm is the global memory (the slowest). hash calculation kernel. This kernel has a simple task, and that is to populate the hashes and indices ar- Due to this architecture, it is impractical to use pointers rays. Algorithm 2 contains a pseudocode version of this and other simple data structures which require them on kernel. Once this kernel has completed, the next step is GPU. Efficient practical algorithms for generating trees to sort by the hash key. This step is accomplished us- on GPUs generally make use of hash tables, or space- ing Thrust [37]. Thrust makes use of a parallel merge filling curves. sort for our purposes. In appropriate situations, it uses an aggressively optimised GPU Radix sort written by Algorithm 2 Data-parallel CUDA kernel which calcu- POSIX threads. One thread is created for every GPU, lates hashes for all boids. and these control the memory and execution of each. func calc_hash_d(...) A POSIX thread barrier is used to synchronise between i = blockIdx.x * blockDim.x the threads. Following this change, we modified the + threadIdx.x kernel execution code to perform interaction to parts of if i < numBoids then the total number of agents. For every POSIX thread, p = positions[i] the range r of agents to evaluate is r = t/g where t //Morton ordering for hashes is the total number of agents, and g is the total num- hash = calc_grid_hash(p) ber of GPUs available. The starting index s is simply hashes[index] = hash s = it/g. Once the agent interactions have been evalu- indices[index] = index ated, the POSIX threads are synchronised and the com- end if plete parts are copied from each GPU back to the host. end This process is effectively a gather operation. Once a frame is drawn or finished, each POSIX thread copies Merrill and Grimshaw [38]. the entire host memory into its GPU, and the process repeats. It is worth noting that one could either have Following the sorting phase of the algorithm, a single one GPU calculate the initial hashes, sort, and reorder- kernel is executed to populate the grid box starting and ing, then copy (effectively scatter) the results to the ending indices, as well as to perform a scatter write of other GPUs via host memory, or simply have each GPU the boids into their sorted positions, so that the hashes independently perform these steps. It is a large waste will be effective when the boid kernel is executed. The of computation, but copying between GPUs will likely boid kernel itself simply evaluates the boids within the cause a higher overhead. In future, we hope to paral- 8 grid boxes surrounding the grid box that the current lelise the previous steps also. boid is in. During this process, the sorting phase will af- ford improved coalesced memory reads for the device, Following from Algorithm 1, the modified algorithm and this greatly speeds up the process. The Morton or- for use with more than one GPU is presented in Alg. 3. dering used when the hashes were calculated also in- This algorithm is a very simple parallelism of Alg. 1, crease memory locality of the grid boxes surrounding but it affords an excellent (albeit single-factor) increase the current boid. in performance, and it also brings into reach even higher numbers of agents, with reasonable computing time. The kernel which calculates the grid box starting and ending indices does so by having each thread keep track Firstly, all vectors are copied onto all devices (a of one grid box hash, and comparing itself against the scatter operation). Then, for n frames, or until the next thread’s hash. The threads which do not match application quits, each GPU will calculate hashes for the hash of the next thread are the ones which mark the each boid, then sort by hash, then populate the grid box boundary points of the grid box, and these threads sim- starting and ending indices, scatter the boids into their ply write their starting and ending indices into an array sorted positions, and finally, perform a partial agent- in global memory. To accelerate this process, shared agent interaction kernel. Once this is complete, the re- memory is used. The threads then reuse themselves by sults are gathered back onto the host, where it is either copying the boid indicated by their indices into the new drawn on the screen, and/or scattered back to the GPUs sorted positions indicated by the index array reordered and the process repeats. by the Thrust sorting phase. The simulation model that we use for benchmarking We modified this algorithm to execute across several these techniques is the Boids model, originally by Craig GPUs in order to both reduce computation time, and Reynolds [6, 7]. In our simulation, we make use of a also to reach into even larger numbers of agents. In goal rule to direct boids toward the center of a fixed order to achieve this, we only parallelise one part of size box. Figures 4(a) and 4(b) show a visual indication the algorithm across more than one GPU - the agent- of boid clustering near the start and end of a simulation, agent interaction kernel. In larger systems, the com- respectively. This shows more clearly that during sim- putation time required by this kernel dwarfs that of the ulation, boids will cluster more tightly in the middle of datastructure construction time and all other subsequent the box. We ensure that this is the case so that we can steps taken. Therefore we find it most suitable to extend determine exactly how the effectiveness of grid-boxing the algorithm in this manner first. is lost as boid clustering increases, and gridboxes con- tain more and more boids. The first modification we made was the addition of host Algorithm 3 Multiple-GPU (mGPU) implementation of grid-boxing in boids. Allocate arrays Copy simulation parameters to constant memory on all devices HOST.copyVectorsToDevices() for i ← 0 to n OR NOT exit condition do i ← i + 1 for each gpu g in parallel do for j ← 0 to NUM BOIDS in parallel do hashes[j] = calculate hash(position[j]) end for Sort by hash key (hashes,indices) Populate gridBoxStart and gridBoxEnd Scatter write boids Perform Boid kernel Figure 5: A y-normalised plot of the agent-agent inter- end for action CUDA kernel. copyVectorsPartiallyFromDevices() drawBoids() copyVectorsToDevices() end for

(a) A histogram of the spatial lo- (b) Another histogram of spatial cations of boids near the begin- locations of boids several hun- ning of the simulation. dred frames later.

Figure 4: Histogram plots of spatial locations of boids during the simulation. Red refers to a cell with 1 boid, Figure 6: A linear plot of the construction time needed green 2 boids, blue 3 boids, and yellow 4 boids. for the datastructure. 3 Results ward the centre of the box they are in. Movement to- We tested the performance of our mGPU and single- wards the centre causes a clustering effect in the centre GPU versions of the grid-boxing algorithm. The re- of the box, which compresses and pulsates a number of sults for mGPU interaction and datastructure construc- times before reaching what can be described as equilib- tion times are presented in Figures 5 and 6 in the form rium. This accounts for the initial increase in computa- of a y-normalised plot for interaction computation, and tion time, followed by slight harmonic fluctuations. a linear plot for the datastructure construction time. In Figure 7 shows how the mGPU implementation scales both of these plots, the data points taken are 10-frame against the single-GPU version of the algorithm. It averages. For the mGPU implementation, we average shows an impressive performance gain over the single- the compute time across the four GPUs that we use. GPU version, particularly as the system becomes larger. The testing GPUs we used are two NVidia GTX 590s. Smaller systems are simply more suited to being evalu- Each of these cards have two physical GPUs each. ated on one GPU to avoid the large overhead of memory Our simulation is simply a flock of boids moving to- copies across the PCI-E bus. The uniform nature of this grid-boxing datastructure makes mGPU an attainable goal for multi-agent sim- ulations, as it allows for distributing data with minimal copying between GPUs, while still preserving accuracy. This technique has seen a lot of use already. Finally, it is worth noting also that grid-boxing is only well-suited to algorithms which have a fixed interac- tion distance. The smaller this distance is, the better the algorithm will perform, given that the grid is appro- priately sized. For the results to remain fully accurate and representative of the original simulation code, the interaction distance must be smaller than the grid box size.

Figure 7: A linear plot of the mGPU and single-GPU 5 Conclusions interaction compute time vs the system size. We have presented a multiple-GPU (mGPU) and single- GPU solution for grid-boxing in multi-spatial-agent simulations. Performance measurements were made, 4 Discussion and the two algorithms compared, and an overwhelm- ing clear advantage of mGPU was seen over the single- Uniform grid-boxing might well be a more simplistic GPU implementation as expected. Even though mem- space partitioning method, but it is readily parallelis- ory sacrifices were made in the process, larger systems able, as opposed to other datastructures inspired from continued to receive an excellent increase in perfor- single-threaded solutions. These solutions often rely on mance. These techniques are vital for facilitation of pointers for the simplest implementation, as well as re- large system sizes of agents and the investigation of log- cursion. These two techniques are the bane of data- arithmic scaled emergent phenomena in simulations. parallel programs. In future we hope to parallelise the hash calculation and The CUDA-GPU memory architecture does not allow the sorting and reordering phases also. These phases for a heap, and branch diversion is an even larger prob- are more difficult to parallelise over multiple GPUs, but lem. Branching is allowed in CUDA, but it comes at a this will greatly improve performance, and allow the performance loss. Each time threads in a warp diverge ability to reach into much larger systems as well. These from one another, some threads simply execute NOOPs approaches will allow interactive-time visualisation of until they converge again. Thus, a CUDA kernel is most large scale spatial agent simulations and hence make it effective when its threads execute the same instructions. easier to search the parameter space of the models. Other space partitioning algorithms also exist which perform well, but they only slightly resemble the algo- rithms which inspire them. For example, it may well be References possible to replace a K-d tree with a hashtable to make it suitable for execution on GPU, but this multi-threaded [1] Couchman, H.M.P.: Cosmological simulations using hashtable requires mutexes - another branch diversion adaptive particle-mesh methods. Web (1994) issue. It seems the only way to perform space partition- [2] Couchman, H.M.P.: Gravitational n-body simulation of ing in reasonable time on GPU to still make the effort of large-scale cosmic structure. Number 13. In: The Rest- less Universe Applications of Gravitational N-Body Dy- writing a CUDA-based program is to utilise some kind namics to Planetary Stellar and Galactic Systems. Taylor of sorting algorithm to place particles in very deliberate and Francis (2001) 239–264 ISBN: 978-0-7503-0822-9. places in memory, to gain both spatial locality and also [3] Couchman, H.M.P.: Simulating the formation of large- some structured method of traversing this data. scale cosmic structure with particle-grid methods. J. Sorting on GPU has been studied extensively, and Comp. and Applied Mathematics 109 (1999) 373–406 [4] Katzenelson, J.: Computational structure of the n-body many excellent solutions already exist, most notably the problem. Technical Report AI Memo 1042, MIT AI Lab Radix sort by Merrill and Grimshaw merged into the (1988) Thrust library, which is shipped with CUDA [38]. [5] Liewer, P.C., Decyk, V., Dawson, J., Fox, G.C.: A uni- gravitational n-body code that runs entirely on the gpu versal concurrent algorithm for plasma particle-in-cell processor. J. Comp. Phys. Online December (2011) 1– simulation codes. In: Proc. Third Hypercube Confer- 34 ence. Number C3P-362 (1988) 1101–1107 [23] Dubinski, J.: A parallel tree code. Interface 1 (1996) [6] Reynolds, C.: Flocks, herds and schools: A distributed 1–19 behavioral model. In: SIGRAPH ’87: Proc. 14th An- [24] Gope, D., Jandhyala, V.: Oct-tree-based multilevel low- nual Conf. on and Interactive Tech- rank decomposition algorithm for rapid 3-d parasitic ex- niques, ACM (1987) 25–34 ISBN 0-89791-227-6. traction. IEEE Trans. Computer-Aided Design of Inte- [7] Reynolds, C.: Boids background and update. http: grated Circuits and Systems 23 (2004) 1575–1580 //www.red3d.com/cwr/boids/ (2011) [25] Jackins, C.L., Tanimoto, S.L.: Oct-trees and their use [8] Hawick, K.A., James, H.A., Scogings, C.J.: Grid- in representing three-dimensional objects. Computer boxing for spatial simulation performance optimisation. Graphics and Image Processing 14 (1980) 249–270 In T.Znati, ed.: Proc. 39th Annual Simulation Sympo- [26] Ricker, P.M.: A direct multigrid poisson solver for oct- sium, Huntsville, Alabama, USA (2006) The Society tree adaptive meshes. The Astrophysical Journal: Sup- for Modeling and Simulation International, Pub. IEEE plement Series 176 (2008) 293–300 Computer Society. [27] Danilewski, P., Popov, S., Slusallek, P.: Binned sah kd- [9] Guo, D.: Spatial cluster ordering and encoding for high- tree construction on a gpu. Technical report, Saarland dimensional geographic knowledge discovery. Techni- University, Germany (2010) 15 Pages. cal report, GeoVISTA Center and Department of Geog- [28] Yang, H.R., Kang, K.K., Kim, D.: Acceleration of mas- raphy, Pennsylvania State University (2002) sive particle data visualization based on gpu. In: Virtual [10] Guo, D., Gehagen, M.: Spatial ordering and encoding and Mixed Reality, Part II HCII 2011. Number 6774 in for geographic data mining and visualization. J. Intell. LNCS, Springer (2011) 425–431 Inf. Sys. 27 (2006) 243–266 [29] Zhang, Y., Liu, L., Lu, X., Li, D.: Efficient range [11] Appel, A.W.: An efficient program for many-body sim- query processing over dhts based on the balanced kautz ulation. SIAM J. Sci. Stat. Comput. 6 (1985) 85–103 tree. Concurrency and Computation: Practice and Ex- [12] Singer, J.K.: Parallel implementation of the fast multi- perience 23 (2011) 796–814 pole method with periodic boundary conditions. East- [30] Kravtsov, A.V., Klypin, A.A., Khokhlov, A.M.: Adap- West J. Numer. Maths 3 (1995) 199–216 tive refinement tree: A new high-resolution n-body code [13] Gelato, S., Chernoff, D.F., Wasserman, I.: An adaptive for cosmological simulations. The Astrophysical Jour- hierarchical particle-mesh code with isolated boundary nal - Supplement Series 111 (1997) 73–94 conditions. The Astrophysical Journal 480 (1997) 115– [31] Green, S.: Particle simulation using . NVIDIA 131 (2010) [14] Barnes, J., Hut, P.: A hierarchical o(n log n) force- [32] Leist, A., Playne, D., Hawick, K.: Exploiting Graphi- calculation algorithm. Nature 324 (1986) 446–449 cal Processing Units for Data-Parallel Scientific Appli- [15] Burton, A., Field, A.J., To, H.W.: A cell-cell barnes-hut cations. Concurrency and Computation: Practice and algorithm for fast particle simulation. Australian Com- Experience 21 (2009) 2400–2437 CSTN-065. puter Science Communications 20 (1998) 267–278 [33] Akarsu, E., Dincer, K., Haupt, T., Fox, G.C.: Particle- [16] Darve, E., Cecka, C., Takahashi, T.: The fast multipole in-cell simulation codes in high performance fortran. In: method on parallel clusters, multicore processors, and Proc. ACM/IEEE Conf. on Supercomputing (SC’96), graphics processing units. Compte Rendus Mecanique Washington DC, USA (1996) 339 (2011) 185–193 [34] Nakasato, N.: Implementation of a parallel tree method [17] Warren, M.S., Salmon, J.K.: Astrophysical n-body sim- on a gpu. J. Comp. Science Online March (2011) 1–10 ulations using hierarchical tree data structures. In: Proc. [35] Schive, H.Y., Zhang, U.H., Chiueh, T.: Directionally ACM IEEE Conf on Supercomputing (SC92). (1992) unsplit hydrodynamic schemes with hybrid mpi/openm- [18] Samet, H.: Data structures for quadtree approxima- p/gpu parallelization in amr. Int. J. High Perf. Comput- tion and compression. Communications of the ACM 28 ing Applications Online November (2011) 1–16 (1985) 973–993 [36] et al., M.: Gpu octrees and optimized search. VIII [19] O’Connell, J., Sullivan, F., Libes, D., Orlandini, E., Tesi, Brazillian Symposium on Games and Digital Entertain- M.C., Stella, A.L., Einstein, T.L.: Self-avoiding ran- ment (2009) 73–76 dom surfaces: Monte carlo study using oct-tree data- [37] Hoberok, Bell: Thrust: A parallel template library. structure. J. Phys. A: Math. and Gen. 24 (1991) 4619– http://www.meganewtons.com/ (2011) 4631 [38] Merrill, Grimshaw: Revisiting sorting for gpgpu stream [20] Warren, M.S., Salmon, J.K.: A parallel hashed oct-tree architectures. In: PACT ’10 Proceedings of the 19th n-body algorithm. In: Supercomputing. (1993) 12–21 international conference on Parallel architectures and [21] Libes, D.: Modeling dynamic surfaces with octrees. compilation techniques. (2010) Computers & Graphics 15 (1991) 383–387 [22] Bedorf, J., Gaburov, E., Zwart, S.P.: A sparse octree