Spatial Data Structures, Sorting and GPU Parallelism for Situated-Agent Simulation and Visualisation

Spatial Data Structures, Sorting and GPU Parallelism for Situated-agent Simulation and Visualisation A.V. Husselmann and K.A. Hawick Computer Science, Institute for Information and Mathematical Sciences, Massey University, North Shore 102-904, Auckland, New Zealand email: f a.v.husselmann, k.a.hawick [email protected] Tel: +64 9 414 0800 Fax: +64 9 441 8181 March 2012 ABSTRACT Spatial data partitioning techniques are important for obtaining fast and efficient simulations of N-Body particle and spatial agent based models where they con- siderably reduce redundant entity interaction computation times. Highly parallel techniques based on con- current threading can be deployed to further speed up such simulations. We study the use of GPU accelerators and highly data parallel techniques which require more complex organisation of spatial datastructures and also sorting techniques to make best use of GPU capabili- ties. We report on a multiple-GPU (mGPU) solution to grid-boxing for accelerating interaction-based models. Our system is able to both simulate and also graphi- cally render in excess of 105 − 106 agents on desktop Figure 1: A visualisation of a uniform grid datastructure hardware in interactive-time. as generated by GPU, using NVIDIA’s CUDA. KEY WORDS thermore, a good scheme will also support - and not grid-boxing; sorting; GPU; thread concurrency; data impede - the introduction of parallelism into the com- parallelism putation so that an appropriate parallel computer architecture [5] can be used to further reduce compute time per simulation step. As a vehicle, we use the Boids 1 Introduction model originally by Reynolds [6, 7]. Simulation problems involving interacting spatially lo- Spatial partitioning [8–10] has been employed as far cated entities such as N-Body particle models [1–4] back as the earliest attempts of Andrew Appel [11] who or spatial agent-based models often require consider- produced the first multipole partitioning method, with able computational power to support simulation of ad- many subsequent developments [12]. The N-Body par- equately large scale systems. Large size is important ticle simulation has served as a simple but effective test- to expose and investigate complex and emergent phe- ing platform for improving the performance of parti- nomena that often only appear on logarithmic length cle simulations and generally interaction-based models and systems size scales. To reduce the O(N 2) compu- which rely on particle-particle interactions [13]. The tational complexity of such interacting systems, spatial vast majority of particle simulations find themselves partitioning methods are often used. Depending upon suffering an inherent performance problem and a severe how effective the scheme used is, the computational lack of system size scaling. This is mostly due to par- complexity can be reduced down to O(N log N). Fur- ticle interaction with every other particle inducing an O(N 2) complexity. Even for a conceptually simple in- (a) Screenshot of our Boids sim- (b) Screenshot of the same simulation. ulation state with visualisation of the grid-box algorithm super- posed. (a) A visual example of a K-d tree datastructure used Figure 2: Visualisation of the same set of agents with extensively on CPU implementations for spatial par- and without a uniform grid. titioning purposes. teraction such as a nearest neighbour (NN) interaction, it is still necessary, in the simplest cases, to perform cal- culations (such as Euclidean distance) for every combi- nation of two particles. Collision detection is one such problem. This problem is mitigated by using methods such as Barnes-Hut [14, 15] treecodes, Fast Multipole methods [16], and tree organisational methods [17, 18] in- cluding Octrees [19–26], K-D trees [27, 28], Kautz trees [29], Adaptive Refined Trees [30], and more, all of which fall under the category of spatial partitioning. Not all of these algorithms would necessarily be suit- (b) K-d tree datastructure, with particles clustered so densely that the datastructure loses its effectiveness. able for any one particle simulation. Some of these algorithms are suitable for simulations which frequently Figure 3: Examples of datastructures used in other im- have high density in one or several spatial locations. plementations for the same purpose. Some perform best with uniformly distributed particles, and some attempt to systematically treat several parti- The former is normally the case when constructing a K- cles in the same light as a point mass to reduce compu- d tree by performing an O(NlogN) sort for every node tation in exchange for losing a small amount of preci- in the tree to obtain the median. The latter is the case sion. This technique is often well-suited to applications when using a linear median-finding algorithm, which in astrophysics. There also exists algorithms which is not trivial to implement on GPU, nor well-suited on sacrifice precision for performance, with techniques as GPU either. simple as updating only a certain number of agents per One space partitioning algorithm responds very well simulation time step, which dramatically increases per- to parallelism: grid-boxing [8] (sometimes known as formance but the loss of precision is so great it will no uniform grid space partitioning). The NVIDIA CUDA longer reflect the original simulation code. Choosing SDK comes with a particle simulator which makes use which one of these algorithms or methods to use in a of this technique [31]. Our single-GPU implementation simulation therefore ultimately depends on the purpose of grid-boxing is heavily based on this. of the simulation. Graphical Processing Units (GPUs) [32] lend them- This work contributes towards efficiency in the simula- selves particularly well to accelerating simulations es- tion of large multi-agent systems which do not require pecially after the advent of NVidia’s CUDA. CUDA is a high precision in very small clusters. To accomplish powerful addition to C++ which allows the programmer such a task in a single-threaded environment, one would to write code in C or C++ containing sections of special normally employ standard techniques such as Octrees, syntax which NVidia’s special compiler in the CUDA or K-d trees which are relatively simple to implement SDK compiles into GPU instructions before handing in single-threading. K-d trees offer some parallelisation the rest of the program code to the system’s built in C opportunities with some authors reporting piece-wise or C++ compiler. GPU datastructure construction algorithms which are of O(N log2 N) and also O(NlogN) complexity [27,28]. Figures 3(a) and 3(b) show examples of K-d trees, which are very popular for accelerating ray tracing pro- Morton ordering (also known as Z-ordering or N- grams [27, 28]. Since many aspects of ray tracing ordering) is a very popular space-filling curve for the are inherently parallel, K-d trees have also received a purpose of ensuring spatial locality [36]. Treecodes large amount of research effort in order to offload as on GPU make use of Morton codes, which serve the much as possible computation from the CPU to GPU. purpose of encoding tree nodes for storing in hashta- For this reason, GPU K-d trees are of great interest, bles [20, 34]. We use this method to increase coalesced and numerous implementations already exist [33–35]. global memory reads. The CUDA architecture gener- Each of these implementations have different ratios of ally reads fixed-size blocks of memory, and caches this CPU code to GPU code. Significant parallelism can be for a short time when a thread requires it. It is for this achieved by building the tree level-by-level instead of reason that scattered reads from many threads earns a node-by-node, but even with this advantage, it is diffi- very large time penalty. cult to retain most computation on the GPU. Our article is structured as follows. In Section 2 we Algorithm 1 Single-GPU implementation of grid- present the method by which we construct the grid- boxing in boids. boxing datastructure, and also the method by which we Allocate and initialise a vec4 array as velocity use the result to accelerate body-body interactions with Allocate and initialise a vec4 array as position multiple-GPUs (mGPU) and also single-GPU in a stan- Allocate a uint array as hashes dard simulation of the Boids model as propounded by Allocate a uint array as indices Reynolds. Following this, in Section 3 we discuss the Allocate a uint array as gridBoxStart results we obtain and the approximate scaling we ob- Allocate a uint array as gridBoxEnd serve from this algorithm, from low numbers of agents Copy simulation parameters to device (16384) to excessive numbers (1,000,000+). In Sec- copyVectorsToDevice() tion 4 we discuss our results and implementation, and //For n frames. how it compares to other techniques. Section 5 provides for i 0 to n OR NOT exit condition do conclusions and future work that we may pursue in this i i + 1 area. //Calculate hashes for all boids. for j 0 to NUM BOIDS do hashes[j] = calculate hash(position[j]) 2 Method end for Sort by hash key (hashes,indices) We use NVidia’s CUDA as the parallelisation platform. Populate gridBoxStart and gridBoxEnd With this choice comes several restrictions, and perhaps Scatter write boids the most prominent of these comes with the memory Perform Boid kernel architecture. A CUDA device’s memory architecture copyVectorsFromDevice() is divided into several kinds: constant, global, shared, drawBoids() and texture. Each of these have different scopes and ac- swapDeviceBuffers() cess penalties. Global memory takes by far the longest end for at about 200 cycles, followed by faster constant memory and very fast shared memory.

Spatial Data Structures, Sorting and GPU Parallelism for Situated-Agent Simulation and Visualisation

Conservation Cores: Reducing the Energy of Mature Computations

Nvidia® Gelato™ 1.0 Hardware

FCM 61 Italiano

The Utilization Wall

3D Computer Graphics Compiled By: H

VMD User's Guide

C 2009 Aqeel A. Mahesri TRADEOFFS in DESIGNING MASSIVELY PARALLEL ACCELERATOR ARCHITECTURES

Enabling Compute-Communication Overlap in Distributed Deep

July/August 2021

Graphics Hardware

Gelato Pro 2.0 and Gelato 2.0 Gpu-Accelerated Final-Frame Renderer

Size Oblivious Programming with Infinimem