CROSS PLATFORM PERFORMANCE PORTABILITY WITH OPENACC Michael Wolfe, PGI Compilers & Tools Performance Portability
§ What is it? § Why is it important? § Why is it harder today than in the last millenium?
Supported Platforms
§ Compilers — pgfortran, pgcc, pgc++ § Host CPUs — 32-bit and 64-bit Intel/AMD X86 hosts § Operating Systems — Linux, Windows § Accelerators — NVIDIA Tesla: Tesla (cc1.x), Fermi (cc2.x), Kepler (cc3.x) — AMD Radeon: Tahiti (HD 7900), Cape Verde (HD 7700), Spectre (Kaveri APU) — Future Plans Include: multicore host, Xeon Phi NVIDIA Kepler Overall Block Diagram*
* From the whitepaper “NVIDIA’s Next Generation CUDATM Compute Architecture: Kepler TM GK110”, © 2012 NVIDIA Corporation. NVIDIA Kepler SMX Block Diagram*
§ 192 SP CUDA cores § 64 DP units 16 § 32 SFUs § 32 ld/st units
* From the whitepaper “NVIDIA’s Next Generation CUDATM Compute Architecture: Kepler TM GK110”, © 2012 NVIDIA Corporation. AMD Radeon 7970 Block Diagram*
*From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc. AMD Radeon 7970 Compute Unit*
*From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc. Architecture Model Selecting target accelerator
§ -ta=tesla:[teslaoptions] cc1x cc2x cc3x cc35 cc1+ cc2+ cc3+ maxregcount:n [no]fma fastmath [no]rdc § -ta=radeon:[radeonoptions] tahiti capeverde spectre buffercount:n § -ta=host § -ta=tesla,radeon,host target accelerator differences
§ Tesla — Planner for Tesla — PGI generated code + CUDA toolkit and runtime — PGI OpenACC runtime § Radeon — Planner for Radeon — PGI generated code + AMD OpenCL toolkit and runtime — uses AMD OpenCL extensions — PGI OpenACC runtime OpenACC compiler/runtime architecture
target-independent analysis
Tesla planner Radeon planner
device code optimizations
CUDA toolkit OpenCL toolkit
embedded device code
compiler-generated runtime interface
OpenACC runtime
CUDA interface OpenCL interface
CUDA Driver OpenCL library Ample Parallelism
§ Nested parallel loops § Parallel outer loops, vector (stride-1) inner loops § Loop collapsing § Asynchronous operations when possible § Minimize data transfers Test platforms
§ Intel Sandybridge core i7-3930K 3.2GHz, 12MB L3$, 16 cores § NVIDIA Tesla K40c, 12GB memory, 15 SMX, 875MHz § AMD Radeon S10000, 3GB memory, 32 CUs, 925 MHz § PGI 14.3 compilers -fast –ta=tesla:cc35 –ta=radeon:tahiti § SPECAccel OpenACC suite, ref dataset Performance Results
bt 75.7 72.8 cg 144 133 cloverleaf 145 182 csp 143 78.6 ep 330 281 ilbdc 97.3 325 md 90.7 97.2 miniGhost 132 249 olbm 194 112 omriq 354 415 ostencil 45.8 53.7 palm 184 216 seismic 133 95.7 sp 118 96.4 swim 88.4 121 Performance Results
NVIDIA Kepler AMD Radeon K40 HD 7970 bt 75.7 72.8 cg 144 133 cloverleaf 145 182 csp 143 78.6 ep 330 281 ilbdc 97.3 325 md 90.7 97.2 miniGhost 132 249 olbm 194 112 omriq 354 415 ostencil 45.8 53.7 palm 184 216 seismic 133 95.7 sp 118 96.4 swim 88.4 121 Target-specific Tuning
§ #pragma acc ... device_type( nvidia ) ... device_type( radeon ) ... — vector_length, num_workers, num_gangs § if( acc_on_device( acc_device_nvidia ) )... if( acc_on_device( acc_device_radeon ) )... § #ifdef Performance Portability with OpenACC
§ Single source, multiple accelerator targets § Single binary, multiple accelerator targets § Performance portability promise, delivered!