Analysis of Performance Gap Between Openacc and the Native Approach on P100 GPU and SW26010: a Case Study with GTC-P

Analysis of Performance Gap Between OpenACC and the Native Approach on P100 GPU and SW26010: A Case Study with GTC-P Stephen Wang†1, James Lin†1, William Tang†2, Stephane Ethier†2, Bei Wang†2, Simon See†1,3 †1 Shanghai Jiao Tong University, Center for HPC †2 Princeton University, Institute for Computational Science & Engineering (PICSciE) and Plasma Physics Laboratory(PPPL) †3 NVIDIA corporation GTC 2018, San Jose, USA March 27, 2018 1 Background • Sunway TaihuLight is now the No.1 supercomputer on the Top500 list. In the near future, Summit in ORNL will be the next leap in the leadership-class supercomputers. à Maintaining the single code on different supercomputers. • The real-world applications with OpenACC can achieve the portability across NVIDIA GPU and Sunway processors. GTC-P code is a case study. à We proposed to analyze the performance gap between the OpenACC version and the native programming approach on two different architectures. 2 GTC-P: Gyrokinetic Toroidal Code - Princeton • Developed by Princeton to accelerate progress in highly-scalable plasma turbulence HPC Particle-in-Cell (PIC) codes • Modern “co-design” version of the comprehensive original GTC code with focus on using Computer Science performance modeling to improve basic PIC operations to deliver simulations at extreme scales with unprecedented resolution & speed on variety of different architectures worldwide • Includes present-day multi-petaflop supercomputers, including Tianhe-2, Titan, Sequoia, Mira, etc., that feature GPU, CPU multicore, and many-core processors • KEY REFERENCE: W. Tang, B. Wang, S. Ethier, G. Kwasniewski, T. Hoefler and etc. ,“Extreme Scale Plasma Turbulence Simulations on Top Supercomputers Worldwide” , Supercomputing (SC), 2016 Conference, Salt Lake City, Utah, USA The case study of GTC-P code with OpenACC • Charge: particle to grid interpolation (SCATTER) • Smooth/Poisson/Field: grid work (local stencil) • Push: • grid to particle interpolation (GATHER) • update position and velocity • Shift: in distributed memory environment, exchange particles among processors 4 The case study of GTC-P code with OpenACC • Challenges a. Memory-bound kernels b. Data hazard c. Random memory access • Methodology a. Decrease the memory bandwidth b. Use atomic operations or duplication and reduction c. Take full advantage of local memory 5 The performance of atomic operations on P100 and SW26010 NVIDIA GPU (P100) CUDA OpenACC Elapsed Time (s) 5.9 6.0 CUDA supports global atomics in a coalesced way by transposing in shared memory Sunway processor OpenACC code Serial code on 1 MPE (SW26010) on 64 CPE Elapsed Time (s) 4.7 2360.5 504x slower !!! unacceptable Atomic operations on SW26010 are implemented by lock-and-unlock methodology. 6 Performance evaluation on NVIDIA P100 • The native atomicAdd instruction is used on P100 instead of compare-and- swap loop implemented with atomicCAS instruction on K80. • The performance gap of GTC-P between CUDA and OpenACC are narrowed with the hardware upgrade. 7 Implementation of the OpenACC version on SW26010 • Duplication and reduction algorithm is used instead of atomic operations, which is implemented with the help of the global variable acc_thread_id. • Using tile directive to coalesced access data by DMA request and fill the 64KB LDM. D M A Main Memory 8 Performance evaluation of the OpenACC version on SW26010 2500 Shift Lower is better Smooth Field • The performance is Poisson Push 2000 Charge acceptable after removing the atomic operations on SW26010. 1500 • Taking full advantage of DMA bandwidth is the 1000 key factor for the Baseline 1.1X memory-bound kernel. Elapsed time [sec] 2.5X • 500 Charge kernel is the hotspot of the OpenACC version. 0 Sequential(MPE) OpenACC(CPE) +Tile +SPMlibrary 9 +w/oatomics Register level communication on SW26010 • The low-latency register communication mechanism is among the CPE cluster, which is the key factor for data locality. 10 The RLC optimization for the charge kernel on SW26010 irregular memory access pattern in the charge kernel • The index value are preconditioned on the MPE and then transfer to the first column of the CPE cluster. • Irregular access is implemented on the rest CPE by row communication. 11 The async optimization for the charge kernel on SW26010 • The irregular memory access implemented by RLC on CPE cluster and the rest part due to the limit of SPM space are running simultaneously. • Tuning the performance manually. 12 Performance tuning of the charge kernel on SW26010 74% Finally, we achieved around 4X speedup compared with OpenACC version and the native approach on SW26010 processors. 13 How about the scaling of the OpenACC version of GTC-P code on the real supercomputers? (Early Results) 14 Experiment results of scaling evaluation on GPU cluster in SJTU Weak Scaling 15 Experiment results of scaling evaluation on Titan supercomputer • One K20X per node • ”Gemini” internconnect • Strong scaling is to be done … 16 Experiment results of scaling evaluation on Sunway TaihuLight supercomputer 17 Summary • The case study demonstrated the portability of OpenACC on GPU and Chinese home-grown many-core processor. Although the algorithm on SW26010 has to be refractored compared with GPU. • The performance gap between the OpenACC version and CUDA of GTC-P on NVIDIA P100 is narrowed with the hardware upgrade. • The experiments showed that performance gap on SW26010 can not be ignored due to the lack of high-efficiency general software cache on the CPE cluster. We designed specific register level communication to fix the problem. 18 Reference • Performance and Portability Studies with OpenACC Accelerated Version of GTC-P. Yueming Wei, Yichao Wang, Linjin Cai, William Tang, Bei Wang, Stephane Ethier, Simon See and James Lin. The 17th International Conference on Parallel and Distributed Computing, Applications and Technologies, Guangzhou, China, December 16-18, 2016. • Porting and Optimizing GTC-P on TaihuLight Supercomputer with Sunway OpenACC. Yichao Wang, James Lin, Linjin Cai, William Tang, Stephane Ethier, Bei Wang, Simon See and Satoshi Matsuoka. Journal of Computer Research and Development, 2018, 55(4). 19.

Analysis of Performance Gap Between Openacc and the Native Approach on P100 GPU and SW26010: a Case Study with GTC-P

Interconnect Your Future Enabling the Best Datacenter Return on Investment

Introduction to Openacc 2018 HPC Workshop: Parallel Programming

GPU Computing with Openacc Directives

Openacc Course October 2017. Lecture 1 Q&As

The Sunway Taihulight Supercomputer: System and Applications

It's a Multi-Core World

Multi-Threaded GPU Accelerration of ORBIT with Minimal Code

Challenges in Programming Extreme Scale Systems William Gropp Wgropp.Cs.Illinois.Edu

HPVM: Heterogeneous Parallel Virtual Machine

Openacc Getting Started Guide

Optimizing High-Resolution Community Earth System

Introduction to GPU Programming with CUDA and Openacc