CROSS PLATFORM PERFORMANCE PORTABILITY WITH OPENACC Michael Wolfe, PGI Compilers & Tools Performance Portability

§ What is it? § Why is it important? § Why is it harder today than in the last millenium?

Supported Platforms

§ Compilers — pgfortran, pgcc, pgc++ § Host CPUs — 32-bit and 64-bit /AMD X86 hosts § Operating Systems — Linux, Windows § Accelerators — Tesla: Tesla (cc1.x), Fermi (cc2.x), Kepler (cc3.x) — AMD Radeon: Tahiti (HD 7900), Cape Verde (HD 7700), Spectre (Kaveri APU) — Future Plans Include: multicore host, Xeon Phi NVIDIA Kepler Overall Block Diagram*

* From the whitepaper “NVIDIA’s Next Generation CUDATM Compute Architecture: Kepler TM GK110”, © 2012 NVIDIA Corporation. NVIDIA Kepler SMX Block Diagram*

§ 192 SP CUDA cores § 64 DP units 16 § 32 SFUs § 32 ld/st units

* From the whitepaper “NVIDIA’s Next Generation CUDATM Compute Architecture: Kepler TM GK110”, © 2012 NVIDIA Corporation. AMD Radeon 7970 Block Diagram*

*From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 , Inc. AMD Radeon 7970 Compute Unit*

*From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc. Architecture Model Selecting target accelerator

§ -ta=tesla:[teslaoptions] cc1x cc2x cc3x cc35 cc1+ cc2+ cc3+ maxregcount:n [no]fma fastmath [no]rdc § -ta=radeon:[radeonoptions] tahiti capeverde spectre buffercount:n § -ta=host § -ta=tesla,radeon,host target accelerator differences

§ Tesla — Planner for Tesla — PGI generated code + CUDA toolkit and runtime — PGI OpenACC runtime § Radeon — Planner for Radeon — PGI generated code + AMD OpenCL toolkit and runtime — uses AMD OpenCL extensions — PGI OpenACC runtime OpenACC compiler/runtime architecture

target-independent analysis

Tesla planner Radeon planner

device code optimizations

CUDA toolkit OpenCL toolkit

embedded device code

compiler-generated runtime interface

OpenACC runtime

CUDA interface OpenCL interface

CUDA Driver OpenCL library Ample Parallelism

§ Nested parallel loops § Parallel outer loops, vector (stride-1) inner loops § Loop collapsing § Asynchronous operations when possible § Minimize data transfers Test platforms

§ Intel Sandybridge core i7-3930K 3.2GHz, 12MB L3$, 16 cores § NVIDIA Tesla K40c, 12GB memory, 15 SMX, 875MHz § AMD Radeon S10000, 3GB memory, 32 CUs, 925 MHz § PGI 14.3 compilers -fast –ta=tesla:cc35 –ta=radeon:tahiti § SPECAccel OpenACC suite, ref dataset Performance Results

bt 75.7 72.8 cg 144 133 cloverleaf 145 182 csp 143 78.6 ep 330 281 ilbdc 97.3 325 md 90.7 97.2 miniGhost 132 249 olbm 194 112 omriq 354 415 ostencil 45.8 53.7 palm 184 216 seismic 133 95.7 sp 118 96.4 swim 88.4 121 Performance Results

NVIDIA Kepler AMD Radeon K40 HD 7970 bt 75.7 72.8 cg 144 133 cloverleaf 145 182 csp 143 78.6 ep 330 281 ilbdc 97.3 325 md 90.7 97.2 miniGhost 132 249 olbm 194 112 omriq 354 415 ostencil 45.8 53.7 palm 184 216 seismic 133 95.7 sp 118 96.4 swim 88.4 121 Target-specific Tuning

§ #pragma acc ... device_type( nvidia ) ... device_type( radeon ) ... — vector_length, num_workers, num_gangs § if( acc_on_device( acc_device_nvidia ) )... if( acc_on_device( acc_device_radeon ) )... § #ifdef Performance Portability with OpenACC

§ Single source, multiple accelerator targets § Single binary, multiple accelerator targets § Performance portability promise, delivered!