Achieving Portable Performance for GTC-P with OpenACC on GPU, multi-core CPU, and Sunway Many-core Processor

Stephen Wang†1, James Lin†1,4, William Tang†2, Stephane Ethier†2, Bei Wang†2, Simon See†1,3 and Satoshi Matsuoka†4

†1 Shanghai Jiao Tong University, Center for HPC †2 Princeton University, Institute for Computational Science & Engineering (PICSciE) and Plasma Physics Laboratory(PPPL) †3 corporation †4 Tokyo Institute of Technology GTC 2017, San Jose, USA May 11, 2017 1 Challenges of supporting multi- and many-cores, the territory of OpenMP

Core Number

1000

100

10

2 GTC-P: Gyrokinetic Toroidal Code - Princeton • Developed by Princeton to accelerate progress in highly-scalable plasma turbulence HPC Particle-in-Cell (PIC) codes • Successfully applied to high-resolution problem-size-scaling studies relevant to the Fusion’s next-generation International Thermonuclear Experimental Reactor (ITER). • Modern “co-design” version of the comprehensive original GTC code with focus on using Computer Science performance modeling to improve basic PIC operations to deliver simulations at extreme scales with unprecedented resolution & speed on variety of different architectures worldwide • Includes present-day multi-petaflop , including Tianhe-2, Titan, Sequoia, Mira, etc., that feature GPU, CPU multicore, and many- core processors • KEY REFERENCE: W. Tang, B. Wang, S. Ethier, G. Kwasniewski, T. Hoefler and etc. ,“Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide” , Supercomputing (SC), 2016 Conference, Salt Lake City, Utah, USA 3 OpenACC Implementations hotspots • Challenges a. Memory-bound kernels b. Data hazard . Random memory access

• Implementations a. Increase memory bandwidth b. Use atomic operations c. Take advantage of local memory

Six Major Subroutines of GTC-P 4 OpenACC Implementations – present directive

5 OpenACC Implementations – atomic directive

6 Run the single OpenACC code base: huge performance gap on x86 and Sunway

GPU (NVIDIA K20) Baseline: CUDA OpenACC Elapsed Time (s) 7.9 16.7 2x slower

x86 multicore Baseline: OpenMP OpenACC ( SNB) Elapsed Time (s) 7.9 1572.8 201x slower!

OpenMP allocates the array copy on each and reduce, without atomic operations.

Sunway many-core Baseline: Serial code OpenACC code (SW 26010) on 1 MPE on 64 CPE Elapsed Time (s) 4.7 2360.5 504x slower !!! unacceptable 7 Our solution for multi- and many-core: using thread-id to duplicate copies for reduction to replace the Fetch-and-Add atomic operation array[thread-id][n] - copy for T1

array[thread-id][n] - copy for T2

Data Hazard Reduction array[n] array[thread-id][n] - copy for T3 (Add)

T1 T2 T3 T4 Irregular Memory Access (Fetch-and-Add) array[thread-id][n] - copy for T4

8 Performance w/o atomic operations on x86 CPU

• Thread ID is not supported for x86 in OpenACC standard yet.

Baseline • Private function in PGI compiler is used here: __pgi_blockidx()

PGI compiler 16.10

9 Implementation on Sunway many-core processor: a customized thread-id extension available from Sunway OpenACC

Architecture overview of SW26010

acc_thread_id is a customized extension provided in Sunway OpenACC

10 Optimization on Sunway many-core processor: data locality in 64KB Scratch Pad Memory • Using tile directive to coalesced access data by per DMA request. The optimum tile size can take full usage of 64KB SPM.

SPM Elapsed Time(s) Lower is better

Memory hierarchy of CPE • Keep data in SPM instead of global memory access.

tile_size 11 (*) Optimization on Sunway many-core processor

• 256-bit SIMD intrinsic

swacc (S2S compiler) sw5cc (native compiler) OpenACC code immediate code (.host and .slave) Execution file “-keep” or “-dumpcommand” can let compiler generate it. But the cost of this kernel is This part in push kernel can achieve 5.6x . too small compared with the entire GTC-P code. 12 Performance on Sunway many-core processor

2500 Shift Lower is better Smooth Field Poisson • Avoid atomic operations. Push 2000 Charge • Increase DMA bandwidth 1500

1000 • Strengthen data locality Baseline 1.1X

Elapsed time [sec] in SPM 2.5X 500 • (*) In-build SIMD code

0 Sequential(MPE) OpenACC(CPE) +Tile +SPMlibrary 13 +w/oatomics Performance and portability of GTC-P on GPU

14 Use native atomic instructions on P100

• Native atomic instructions (FP64) are supported on Pascal architecture.

• Compare the PTX code generated by PGI 16.10 compiler on K80 and P100.

15 OpenACC version of GTC-P on K80 and P100

• Performance of OpenACC version on P100 is close to CUDA code due to the better atomic instructions support. • OpenACC benefit from the hardware support on the latest GPU architecture.

16 Use specific algorithm for GPU in OpenACC code

Remove auxiliary array which use to store the 4 points 17 Performance results of OpenACC version with new algorithm on GPU

Tesla K40 GPU CUDA OpenACC new OpenACC

B 100 1 1399MB/GPU 3070MB/GPU 1501MB/GPU 2 * GPU B 100 1 742MB/GPU 1569MB/GPU 785MB/GPU 4 * GPU

50% device memory usage reduce 18 Core Number

1000

100

Hardware support for key operations 10 Gap of memory hierarchy 19 Summary

• Optimizations for specific architecture are necessary to reasonable performance in GTC-P code.

• Native atomic support on GPU can achieve better performance of OpenACC code compared with the same operations on multi- and many-core now.

• The gap of memory hierarchy between different architectures may cause different algorithm for OpenACC code.

20 Reference

• Stephen Wang, James Lin, Linjin Cai, William Tang, Stephane Ethier, Bei Wang, Simon See and Satoshi Matsuoka. “Porting and Optimizing GTC-P on TaihuLight with Sunway OpenACC.” HPC China, 2016. Best Paper Award (Acceptance Rate < 3%)

• Yueming Wei, Stephen Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See and James Lin. “Performance and Portability Studies with OpenACC Accelerated Version of GTC-P.” PDCAT, 2016.

21