Achieving Portable Performance for GTC-P with Openacc on GPU, Multi-Core CPU, and Sunway Many-Core Processor

Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor Stephen Wang†1, James Lin†1,4, William Tang†2, Stephane Ethier†2, Bei Wang†2, Simon See†1,3 and Satoshi Matsuoka†4 †1 Shanghai Jiao Tong University, Center for HPC †2 Princeton University, Institute for Computational Science & Engineering (PICSciE) and Plasma Physics Laboratory(PPPL) †3 NVIDIA corporation †4 Tokyo Institute of Technology GTC 2017, San Jose, USA May 11, 2017 1 Challenges of supporting multi- and many-cores, the territory of OpenMP Core Number 1000 100 10 2 GTC-P: Gyrokinetic Toroidal Code - Princeton • Developed by Princeton to accelerate progress in highly-scalable plasma turbulence HPC Particle-in-Cell (PIC) codes • Successfully applied to high-resolution problem-size-scaling studies relevant to the Fusion’s next-generation International Thermonuclear Experimental Reactor (ITER). • Modern “co-design” version of the comprehensive original GTC code with focus on using Computer Science performance modeling to improve basic PIC operations to deliver simulations at extreme scales with unprecedented resolution & speed on variety of different architectures worldwide • Includes present-day multi-petaflop supercomputers, including Tianhe-2, Titan, Sequoia, Mira, etc., that feature GPU, CPU multicore, and many- core processors • KEY REFERENCE: W. Tang, B. Wang, S. Ethier, G. Kwasniewski, T. Hoefler and etc. ,“Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide” , Supercomputing (SC), 2016 Conference, Salt Lake City, Utah, USA 3 OpenACC Implementations hotspots • Challenges a. Memory-bound kernels b. Data hazard c. Random memory access • Implementations a. Increase memory bandwidth b. Use atomic operations c. Take advantage of local memory Six Major Subroutines of GTC-P 4 OpenACC Implementations – present directive 5 OpenACC Implementations – atomic directive 6 Run the single OpenACC code base: huge performance gap on x86 and Sunway GPU (NVIDIA K20) Baseline: CUDA OpenACC Elapsed Time (s) 7.9 16.7 2x slower x86 multicore Baseline: OpenMP OpenACC (Intel SNB) Elapsed Time (s) 7.9 1572.8 201x slower! OpenMP allocates the array copy on each thread and reduce, without atomic operations. Sunway many-core Baseline: Serial code OpenACC code (SW 26010) on 1 MPE on 64 CPE Elapsed Time (s) 4.7 2360.5 504x slower !!! unacceptable 7 Our solution for multi- and many-core: using thread-id to duplicate copies for reduction to replace the Fetch-and-Add atomic operation array[thread-id][n] - copy for T1 array[thread-id][n] - copy for T2 Data Hazard Reduction array[n] array[thread-id][n] - copy for T3 (Add) T1 T2 T3 T4 Irregular Memory Access (Fetch-and-Add) array[thread-id][n] - copy for T4 8 Performance w/o atomic operations on x86 CPU • Thread ID is not supported for x86 in OpenACC standard yet. Baseline • Private function in PGI compiler is used here: __pgi_blockidx() PGI compiler 16.10 9 Implementation on Sunway many-core processor: a customized thread-id extension available from Sunway OpenACC Architecture overview of SW26010 acc_thread_id is a customized extension provided in Sunway OpenACC 10 Optimization on Sunway many-core processor: data locality in 64KB Scratch Pad Memory • Using tile directive to coalesced access data by per DMA request. The optimum tile size can take full usage of 64KB SPM. SPM Elapsed Time(s) Lower is better Memory hierarchy of CPE • Keep data in SPM instead of global memory access. tile_size 11 (*) Optimization on Sunway many-core processor • 256-bit SIMD intrinsic swacc (S2S compiler) sw5cc (native compiler) OpenACC code immediate code (.host and .slave) Execution file “-keep” or “-dumpcommand” can let compiler generate it. But the cost of this kernel is This part in push kernel can achieve 5.6x speedup. too small compared with the entire GTC-P code. 12 Performance on Sunway many-core processor 2500 Shift Lower is better Smooth Field Poisson • Avoid atomic operations. Push 2000 Charge • Increase DMA bandwidth 1500 1000 • Strengthen data locality Baseline 1.1X Elapsed time [sec] in SPM 2.5X 500 • (*) In-build SIMD code 0 Sequential(MPE) OpenACC(CPE) +Tile +SPMlibrary 13 +w/oatomics Performance and portability of GTC-P on GPU 14 Use native atomic instructions on P100 • Native atomic instructions (FP64) are supported on Pascal architecture. • Compare the PTX code generated by PGI 16.10 compiler on K80 and P100. 15 OpenACC version of GTC-P on K80 and P100 • Performance of OpenACC version on P100 is close to CUDA code due to the better atomic instructions support. • OpenACC benefit from the hardware support on the latest GPU architecture. 16 Use specific algorithm for GPU in OpenACC code Remove auxiliary array which use to store the 4 points 17 Performance results of OpenACC version with new algorithm on GPU Tesla K40 GPU CUDA OpenACC new OpenACC B 100 1 1399MB/GPU 3070MB/GPU 1501MB/GPU 2 * GPU B 100 1 742MB/GPU 1569MB/GPU 785MB/GPU 4 * GPU 50% device memory usage reduce 18 Core Number 1000 100 Hardware support for key operations 10 Gap of memory hierarchy 19 Summary • Optimizations for specific architecture are necessary to reasonable performance in GTC-P code. • Native atomic support on GPU can achieve better performance of OpenACC code compared with the same operations on multi- and many-core now. • The gap of memory hierarchy between different architectures may cause different algorithm for OpenACC code. 20 Reference • Stephen Wang, James Lin, Linjin Cai, William Tang, Stephane Ethier, Bei Wang, Simon See and Satoshi Matsuoka. “Porting and Optimizing GTC-P on TaihuLight Supercomputer with Sunway OpenACC.” HPC China, 2016. Best Paper Award (Acceptance Rate < 3%) • Yueming Wei, Stephen Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See and James Lin. “Performance and Portability Studies with OpenACC Accelerated Version of GTC-P.” PDCAT, 2016. 21.

Achieving Portable Performance for GTC-P with Openacc on GPU, Multi-Core CPU, and Sunway Many-Core Processor

Introduction to Openacc 2018 HPC Workshop: Parallel Programming

GPU Computing with Openacc Directives

Openacc Course October 2017. Lecture 1 Q&As

Multi-Threaded GPU Accelerration of ORBIT with Minimal Code

HPVM: Heterogeneous Parallel Virtual Machine

Openacc Getting Started Guide

Introduction to GPU Programming with CUDA and Openacc

Investigation of the Opencl SYCL Programming Model

PGI Accelerator with Openacc Getting Started Guide

Multiresolution Parallel Programming with Chapel

Concurrent Parallel Processing on Graphics and Multicore Processors with Openacc and Openmp

Introduction to Openmp & Openacc