Cross-Platform Performance Portability Using Openacc

CROSS PLATFORM PERFORMANCE PORTABILITY WITH OPENACC Michael Wolfe, PGI Compilers & Tools Performance Portability § What is it? § Why is it important? § Why is it harder today than in the last millenium? Supported Platforms § Compilers — pgfortran, pgcc, pgc++ § Host CPUs — 32-bit and 64-bit Intel/AMD X86 hosts § Operating Systems — Linux, Windows § Accelerators — NVIDIA Tesla: Tesla (cc1.x), Fermi (cc2.x), Kepler (cc3.x) — AMD Radeon: Tahiti (HD 7900), Cape Verde (HD 7700), Spectre (Kaveri APU) — Future Plans Include: multicore host, Xeon Phi NVIDIA Kepler Overall Block Diagram* * From the whitepaper “NVIDIA’s Next Generation CUDATM Compute Architecture: Kepler TM GK110”, © 2012 NVIDIA Corporation. NVIDIA Kepler SMX Block Diagram* § 192 SP CUDA cores § 64 DP units 16 § 32 SFUs § 32 ld/st units * From the whitepaper “NVIDIA’s Next Generation CUDATM Compute Architecture: Kepler TM GK110”, © 2012 NVIDIA Corporation. AMD Radeon 7970 Block Diagram* *From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc. AMD Radeon 7970 Compute Unit* *From “AMD Accelerated Parallel Processing – OpenCL Programming Guide”, © 2012 Advanced Micro Devices, Inc. Architecture Model Selecting target accelerator § -ta=tesla:[teslaoptions] cc1x cc2x cc3x cc35 cc1+ cc2+ cc3+ maxregcount:n [no]fma fastmath [no]rdc § -ta=radeon:[radeonoptions] tahiti capeverde spectre buffercount:n § -ta=host § -ta=tesla,radeon,host target accelerator differences § Tesla — Planner for Tesla — PGI generated code + CUDA toolkit and runtime — PGI OpenACC runtime § Radeon — Planner for Radeon — PGI generated code + AMD OpenCL toolkit and runtime — uses AMD OpenCL extensions — PGI OpenACC runtime OpenACC compiler/runtime architecture target-independent analysis Tesla planner Radeon planner device code optimizations CUDA toolkit OpenCL toolkit embedded device code compiler-generated runtime interface OpenACC runtime CUDA interface OpenCL interface CUDA Driver OpenCL library Ample Parallelism § Nested parallel loops § Parallel outer loops, vector (stride-1) inner loops § Loop collapsing § Asynchronous operations when possible § Minimize data transfers Test platforms § Intel Sandybridge core i7-3930K 3.2GHz, 12MB L3$, 16 cores § NVIDIA Tesla K40c, 12GB memory, 15 SMX, 875MHz § AMD Radeon S10000, 3GB memory, 32 CUs, 925 MHz § PGI 14.3 compilers -fast –ta=tesla:cc35 –ta=radeon:tahiti § SPECAccel OpenACC suite, ref dataset Performance Results bt 75.7 72.8 cg 144 133 cloverleaf 145 182 csp 143 78.6 ep 330 281 ilbdc 97.3 325 md 90.7 97.2 miniGhost 132 249 olbm 194 112 omriq 354 415 ostencil 45.8 53.7 palm 184 216 seismic 133 95.7 sp 118 96.4 swim 88.4 121 Performance Results NVIDIA Kepler AMD Radeon K40 HD 7970 bt 75.7 72.8 cg 144 133 cloverleaf 145 182 csp 143 78.6 ep 330 281 ilbdc 97.3 325 md 90.7 97.2 miniGhost 132 249 olbm 194 112 omriq 354 415 ostencil 45.8 53.7 palm 184 216 seismic 133 95.7 sp 118 96.4 swim 88.4 121 Target-specific Tuning § #pragma acc ... device_type( nvidia ) ... device_type( radeon ) ... — vector_length, num_workers, num_gangs § if( acc_on_device( acc_device_nvidia ) )... if( acc_on_device( acc_device_radeon ) )... § #ifdef Performance Portability with OpenACC § Single source, multiple accelerator targets § Single binary, multiple accelerator targets § Performance portability promise, delivered! .

Cross-Platform Performance Portability Using Openacc

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support