Direct N-Body Kernels for Multicore Platforms
Total Page:16
File Type:pdf, Size:1020Kb
Direct N-body Kernels for Multicore Platforms Nitin Arora Aashay Shringarpure Richard W. Vuduc School of Aerospace Engineering School of Computer Science Computational Science Georgia Institute of Technology Georgia Institute of Technology and Engineering Division Atlanta, Georgia 30332-0150 Atlanta, Georgia 30332-0765 Georgia Institute of Technology [email protected] [email protected] Atlanta, Georgia 30332-0765 [email protected] Abstract—We present an inter-architectural comparison of and highlight the best performance at large values of n. Small single- and double-precision direct n-body implementations on values of n, which are of great interest in, for instance, modern multicore platforms, including those based on the Intel hierarchical tree-based approximation n-body algorithms, may Nehalem and AMD Barcelona systems, the Sony-Toshiba-IBM PowerXCell/8i processor, and NVIDIA Tesla C870 and C1060 require a different emphasis on low-level tuning techniques. GPU systems. We compare our implementations across platforms Secondly, the computation is compute intensive with largely on a variety of proxy measures, including performance, coding regular access patterns, and so we do not stress the memory complexity, and energy efficiency. system. Still, we believe this study can make a useful contri- bution to other computations more generally. In the numerical I. INTRODUCTION domain, hardware reciprocal square root is typically fast for The goal of this study is to understand the differences in single precision but not double, making our performance the implementation techniques required for direct (particle- far from peak even though the computations are relatively particle) n-body simulations to achieve good performance on compute-bound. Furthermore, for our accelerator architectures, a variety of modern multicore CPU and accelerator desktop we still need to carefully manage the local store to make platforms. These platforms include an 8-core (dual-socket effective use of available memory system resources. This quad-core) 2-way/core multithreaded Intel Nehalem processor- points to non-trivial implementation trade-off considerations based system (16 effective threads); a 16-core (quad-socket across multicore architectures, in further support of recent quad-core) AMD Barcelona-based system; IBM dual-socket work [7]. QS22 blade based on the Sony-Toshiba-IBM PowerXCell/8i processor; and NVIDIA Tesla C870 and C1060 graphics processing unit (GPU) systems. II. RELATED WORK We began the study with a focus on direct n-body com- Parallelization of direct n-body problems is well-studied [8], putations primarily because of our interest in this application [9], and has a long history that includes custom hardware (e.g., domain. The structure of the computation is a fundamental GRAPE systems). Most recently, there has been one thorough building-block in larger applications [1], [2] as well as approx- study for x86-based CPU tuning [10], as well as several imate hierarchical tree-based algorithms for larger systems, studies on GPUs both prior to [11], [12] and following [13], e.g., Barnes-Hut or fast multipole method (FMM) [3]–[5]. [14] the introduction of the high-level CUDA model, which Moreover, lessons-learned in implementing these kernels for we use in the present study. The post-CUDA direct n-body physics applications readily extend to new application domains GPU applications show particularly impressive performance, in statistical data analysis, search, and mining [6]. with comparison to existing custom hardware used for n- We initially believed it would be simple to achieve near- body simulations, like the GRAPE-6AF. These prior GPU peak performance on these platforms. The key computational studies focus primarily on single-precision kernels, as that bottleneck is the O(n2) evaluation of pairwise interaction was the only precision available in hardware at the time. To forces among a system of n particles. This kernel is highly our knowledge there have not been any published n-body regular and floating-point (flop) intensive, and therefore well- implementation performance studies for the STI Cell/B.E. suited to accelerators like GPUs or PowerXCell/8i. However, processor family. This work tries to fill these comparison gaps to our surprise, each implementation required more significant by including both GPUs and STI Cell/B.E. platforms, as well tuning effort than expected. We describe these implementa- as multicore CPU systems. tions and some “lessons-learned” in this paper. As far as we know, ours is among the first comprehensive cross-platform multicore studies for computations in this domain, and in III. DIRECT N-BODY IMPLEMENTATIONS particular unique in its contrasting of GPU and Cell-based accelerated systems. We implemented various parallel versions of a simple n- Given that we consider only the O(n2) direct evaluation, body gravitational simulation, using direct (particle-particle) our study will be limited in several ways. First, we achieve O(n2) force evaluation. In particular, we numerically integrate j the equations of motion for each particle i, m x X ~ri − ~rj ~ y Fi = −Gmi mj 3 (1) jj~ri − ~rjjj x y z z 1≤j≤n j6=i x y z where particle i has mass mi, is located at the 3-D position ~ri, i ~ and experiences a force Fi from all other particles; G is the ~ 20 flops / interaction ai universal gravitational constant, which we normalize to 1. We use the Verlet algorithm for our numerical integration scheme. A. Characteristics, costs, and parallelization The bottleneck is the O(n2) force evaluation, which com- putes the acceleration for all bodies i (with G = 1): Fig. 1. Computation and data access pattern for direct force evaluation X ~ri − ~rj ~ai ≈ mj 3 (2) 2 2 2 1≤j≤n (jj~ri − ~rjjj + ) Here, we adopt the common convention of a “softening computation and data access pattern as shown in Figure 1. n n × 3 parameter” term, 2, in the denominator, to improve the overall The 3-D points appear as the matrix on the left. i i stability of the numerical integration scheme [2]. Computing The loop iterates over these points (rows); for each , we j the acceleration dominates the overall simulation cost, so stream over the same points , shown mirrored (transposed) ~a that in this paper we focus on its parallelization and tuning along the top, and accumulate an acceleration vector i. The for accelerators, and may effectively ignore the cost of the simplest coarse-grained parallelization is an owner-computes i n=p numerical integrator. approach that partitions the the loop into chunks among p The scalar pseudocode for Equ. (2) can be written as follows threads. This approach requires no synchronization on writes (comments prefixed by “//”): to the final acceleration matrix. 1: for all bodies i do B. x86 Implementations 2: // Load (xi; yi; zi) here For the general-purpose multicore platforms (2×4-core×2- 3: ~ai ≡ (ax; ay; az) (0; 0; 0) // Init acceleration threads/core=16 thread Intel Nehalem and 4×4=16-core AMD 4: for all bodies j 6= i do Barcelona systems), we consider several “standard” optimiza- 5: // Load (xj; yj; zj) here tion techniques. These techniques are well-known but essential 6: (∆x; ∆y; ∆z) (xi − xj; yi − yj; zi − zj) to implement if we wish to fairly evaluate the accelerator- 2 2 2 2 7: γ (∆x) + (∆y) + (∆z) + based GPU and PowerXCell/8i platforms. p 3 p 8: s mj=(γ · γ) // γ 2 = γ γ First, we exploit coarse-grained owner-computes paralleliza- 9: (ax; ay; az) (ax + s · ∆x; ay + s · ∆y; az + s · ∆z) tion at the outermost loop of the interaction calculation. Each 10: end for n thread computes the forces on a subset of p particles. In this 11: // Store acceleration decomposition, all writes are independent. 12: ~ai (ax; ay; az) We consider both OpenMP and Pthreads programming 13: end for models, though we expect that OpenMP-parallelization will Lines 5–9 compute a single pairwise interaction. As written, be sufficient since the kernel is largely compute-bound. In there are 18 floating-point operations: 3 subtractions (line 6), the case of Pthreads, we create a team of threads at the 3 multiplies and 3 adds (line 7), 1 multiply, 1 square root, beginning of program execution. These threads busy-wait at and 1 divide (line 8), and 3 multiplies and 3 adds (line 9). a barrier for a signal from the master thread to begin the For consistency in comparing flop-rates to other papers, we interaction calculation. Upon completion, the threads again consider the “cost” of lines 5–8 to be 20 flops, i.e., we count meet at a barrier. (Reported performance includes the time 20 flops per pairwise interaction, or 20n(n − 1) flops total. of these barriers.) There are 4n2 +6n loads and stores, with the dominant term Secondly, we apply cache-level blocking of both loops. coming from the loads of the 3 coordinates on line 5–6 and Since the memory footprint scales like O(n) while flops mass on line 8. However, there is a significant amount of reuse, scale as O(n2), we do not expect this technique to provide so that with appropriate cache blocking we can incur close to much benefit until n becomes very large, since (a) typical the minimum of 4n compulsory misses, so that the overall L1 latencies for data that resides in L2 are relatively small, computation should be compute-bound with a computational and (b) 2 MB or larger L2 and L3 caches today can easily intensity of ≈ 20n2=4n = 5n flops per word. accommodate n ≈ 100; 000 points. For all platforms, we adopt the standard approach to paral- Thirdly, we manually combine data alignment, unroll-and- lelization that exploits both coarse-gained parallelization of the jam, and SIMD vectorization transformations, to improve outermost i loop, followed by finer-grained data-parallelism register usage and exploit fine-grained data parallelism.