Using CAPS Compiler on NVIDIA Kepler and CARMA Systems

Using CAPS Compiler on NVIDIA Kepler and CARMA Systems F. Bodin CTO – CAPS entreprise Introduction • CAPS develops programming tools to help writing a unique source code that can be executed on existing accelerator technologies o C / C++ / Fortran • Fast moving hardware systems require two directives sets o OpenHMPP - easy to extend – integrate new HW features o OpenACC - standardized – longer term view but moving slowly • Generates CUDA or OpenCL codes o Portable on AMD GPU and APU, Intel MIC, Nvidia Kepler-Carma, … nvidia SC 2012 www.caps-entreprise.com 2 CAPS Technology • Provide OpenACC and OpenHMPP directives o OpenHMPP codelet based o OpenACC code region based #pragma hmpp myfunc codelet, … void saxpy(int n, float alpha, float x[n], float y[n]){ #pragma hmppcg gridify(i) for(int i = 0; i<n; ++i) y[i] = alpha*x[i] + y[i]; } #pragma acc kernels … { for(int i = 0; i<n; ++i) y[i] = alpha*x[i] + y[i]; } nvidia SC 2012 www.caps-entreprise.com 3 Compilation Process • Source-to-source technology C++ C Fortran Frontend Frontend Frontend Extraction module codelets Host Fun#1 Fun #2 code Fun #3 Instrumentation OpenCL/Cuda module Generation CPU compiler Native (gcc, ifort, …) compilers Executable CAPS HWA Code (mybin.exe) Runtime (Dyn. library) www.caps-entreprise.com nvidia SC 2012 4 A Few Typical Situations 1. Simple nested loops 6. Dealing with accelerated library 2. Data transfer optimization 7. Dealing with dynamic accelerated tasks scheduling 3. Complex loop nests 8. Using multiple accelerators 4. Code tuning 9. Nested parallelism using native 5. Integrating auto-tuning techniques nvidia SC 2012 www.caps-entreprise.com 5 Simple nested loops - 1 Host • The simple construct is to CPU declare a parallel loop to be code Accelerator compiled and executed on an accelerator send A,B,C o Iterations of the loop nests are converted into threads execute kern. 1 • Data in and out declaration is used to determine the accelerator for wait data to move between the get A host and the accelerator nvidia SC 2012 www.caps-entreprise.com 6 Simple nested loops - 2 • Example of stencil computation ... #pragma acc kernels pcopyin(A[0:m]) pcopy(B[0:m]) { float c11,c12,c13,c21,c22,c23,c31,c32,c33; c11 = +2.0f; c21 = +5.0f; c31 = -8.0f; ... #pragma acc loop independent for (int i = 1; i < M - 1; ++i){ #pragma acc loop independent for (int j = 1; j < N - 1; ++j){ B[i][j] = c11*A[i-1][j-1]+c12*A[i][j-1]...; }}} ... nvidia SC 2012 www.caps-entreprise.com 7 Data Transfer Optimization - 1 Host CPU Accelerator code • Data transfers between the send A execute host CPU and the accelerator csite 1 kern. 1 may very negatively impact on get A performance send A execute csite 2 kern. 2 • A set of directives are provided to keep data on the accelerator csite 3 execute beyond the execution of one kern. 3 kernel csite 4 execute kern. 4 get A nvidia SC 2012 www.caps-entreprise.com 8 Data Transfer Optimization - 2 • Example from HydroC* main (100%) void hydro_godunov (…) hydro_godunov compute_deltat { (97.0%) (2.9%) #pragma acc data \ create(qleft[0:H.nvar], qright[0:H.nvar], \ riem slop trace \...\) \ ann e (10.1 (48.2 (11.2 %) %) %) gatherConservativeVa equationofst copy(uold[0:H.nvar*H.nxt*H.nyt]) \ updateConservativeVar rs ate s (5.4%) (5.0%) copyin(Hstep) (5.6%) qleftri constop cm ght rim pflx { (4.9%) (4.6%) (3.0 for (j = Hmin; j < Hmax; j += Hstep){ %) // compute many slices each pass int jend = j + Hstep; if (jend >= Hmax) Data are left on the GPU jend = Hmax; during the step loop. .// the work here pcopy clauses are used } // for j into called routines }//end of data region *Pierre-François Lavalléea, Guillaume Colin de Verdièreb, ... Philippe Wauteleta, Dimitri Lecasa, Jean-Michel Dupaysa nvidia SC 2012 www.caps-entreprise.com aIDRIS/CNRS, bCEA,Centre DAM 9 Complex Loop Nests - 1 • Non perfectly nested loops can be challenging to parallelize efficiently o OpenACC parallel regions provide control over the parallelization scheme o Requires to distribute the iteration spaces onto gangs and workers Device Gang Workers Vectors nvidia SC 2012 www.caps-entreprise.com 10 Complex Loop Nests - 2 • Extract from NOAA Nonhydrostatic Icosahedral Model (NIM) code !$acc parallel present(nprox,prox,u,...) vector_length(1) num_workers(64) num_gangs(512) !$acc loop gang private (rhsu,...) private(ipn,k,isn,...) do ipn=ips,ipe n = nprox(ipn) ipp1 = prox(1,ipn) (continued from previous page) ... !$acc loop seq !$acc loop worker vector do isn = 1,nprox(ipn) do k=1,nz-1 isp=mod(isn,nprox(ipn))+1 rhsu(k,1) = cs(1,ipn)*u(k ,ipp1)... !$acc loop worker vector ... do k = 2,nz-1 enddo !k-loop ... k=nz-1 end do ! k -loop rhsu(k+1,1) = cs(1,ipn)*u(k ,ipp1)... sedgvar( 1,isn,ipn,1)=(zm(1,ipn)... ... ... !$acc loop worker vector private(wk) end do ! isn-loop do k=1,nz !$acc loop worker vector Lots of statements do k=1,nz enddo !k-loop kp1=min(nz,k+1) !$acc loop seq ... do isn = 1,nprox(ipn) end do !$acc loop worker vector bedgvar(0,ipn,1)=... do k=1,nz-1 enddo !ipn-loop Tgtu(k,isn) = ... !$acc end parallel enddo !k-loop Tgtu(nz,isn) = 2.*Tgtu(nz-1,isn) - ... end do ! isn-loop nvidia(continued SC 2012 on next page) www.caps-entreprise.com 11 Accelerating Heat Transfer Ray-Tracing Code with CAPS OpenACC Compiler 120000 ) • Speedup X 19 s 100000 o With a FERMI C2050 80000 60000 • Speedup X 34 X 19 40000 o With K20 Execution time ( time Execution 20000 0 3840 15360 76800 200000 Performances for various numbers of rays and configurations. PROMES Laboratory Application: CAPS Experimentations done on FERMI C2050 / K20 in comparison ray-by-ray heat transfer simulation (DP) to Sandy Bridge E5-2687W nvidia SC 2012 www.caps-entreprise.com 12 OpenACC on Nvidia CARMA Accelerating Heat Transfer Ray-Tracing Code with CAPS OpenACC Compiler o ARM CPU + Accelerator Target o Speedup 12x ARM cores and CARMA GPU CUDA on ARM nvidia SC 2012 www.caps-entreprise.com 13 OpenHMPP MD Example • Molecular dynamic codes (HLRS / Colin Glass) • From the APOS project (http://apos-project.eu) source: HLRS nvidia SC 2012 www.caps-entreprise.com 14 OpenHMPP Stereo Vision Example • Stereo Matching (ESAW) on GTX 550 Ti • Result from the ANR Compa project From: Jinglin ZHANG, Jean-Francois NEZAN, Jean-Gabriel COUSIN, Erwan RAFFIN "Implementation of Stereo Matching Using A High Level Compiler for Parallel Computing Acceleration" IVCNZ ’12, November 26 - 28 2012, Dunedin, New Zealand nvidia SC 2012 www.caps-entreprise.com 15 Code Tuning - 1 • The more optimized a code is, the less portable it is o Optimized code tends to saturate some hardware resources o Parallelism ROI varies a lot • i.e. # threads and workload need to be tuned o Many resources not virtualized on HWA (e.g. registers, #threads) Threads 1 0.8 HW1 0.6 0.4 Occupancy Registers/threads 0.2 HW2 Run 1 norm 0 Run 2 norm performance cores Mem. Throughput L1 Hit Ratio Example of an optimized versus a non optimized stencil code nvidia SC 2012 www.caps-entreprise.com 16 Code Tuning - 2 • Express code transformations via directives #pragma hmpp <mygroup> sgemm codelet, args[tout].io=inout, & #pragma hmpp & args[*].mirror args[*].transfer=manual void sgemm( float alphav[1], float betav[1], const float t1[SIZE][SIZE],Loop transformations const float t2[SIZE][SIZE], float tout[SIZE][SIZE] ) { int j, i; const float alpha = alphav[0], beta = betav[0]; #pragma hmppcg(OCL) unroll i:4, j:4, split(i), noremainder(i,j), jam #pragma hmppcg gridify (j,i) for( j = 0 ; j < SIZE ; j++ ) { for( i = 0 ; i < SIZE ; i++ ) { int k; float prod = 0.0f; for( k = 0 ; k < SIZE ; k++ ) { Apply only when prod += t1[k][i] * t2[j][k]; compiling for OpenCL } tout[j][i] = alpha * prod + beta * tout[j][i]; } } } nvidia SC 2012 www.caps-entreprise.com 17 Integrating Auto-Tuning Techniques - 1 • Adaptation of the code @ runtime o Use multiple call to a kernel to find the most efficient one • Need to create an optimization space to explore o Compiler issue runtime configurable #gang, #worker, #vector • Need a way to explore optimization space o Auto-tuning driver issue o may also focus on execution time or energy Source code HMPP Compiler Autotunable hmpp auto-tuning collect profiling data explore the variants executable profiling driver space code interface nvidia SC 2012 www.caps-entreprise.com 18 Integrating Auto-Tuning Techniques – 2-a • Auto-tuning implementation of a Blur filter in OpenACC • Explore dynamic parameters (e.g. #gangs, #workers) #pragma acc parallel, copyin(dst_caps[0:height*width]), copyout(src_caps[0:height*width]), num_gangs(gangs), Parameterized num_workers(workers), vector_length(32) { parallel regions #pragma acc loop, gang for (tileY = 0; tileY < tileCountY; tileY++) { parameter space for (tileX = 0; tileX < tileCountX; tileX++) { to explore … set auto-tuning driver on size_t gangs[] = { 8, 16, 32, 64, 128, 128, 8, 16, 32, 64, 128, 256 }; size_t workers[] = { 16, 16, 16, 16, 16, 16, 24, 24, 24, 24, 24, 24 }; … while (nber_of_iterations < max_iterations) { … variant = variantSelectorState("kernel.c:21", (sizeof(gangs)/sizeof(size_t))-1); blur(images[(currentImage + 1) % 2], image_caps, width, height, blockSize, gangs[variant], workers[variant]); … } nvidia SC 2012 www.caps-entreprise.com 19 Integrating Auto-Tuning Techniques – 2-b • Auto-tuning driver behavior with DNADist (bio info) Kernel Computation Time (in sec). Lower is better 10 config. 8 = 14 G x 16 W 8 config 10 = 256 G x 128 W 6 4 Kernel Computation Time (in sec).

Using CAPS Compiler on NVIDIA Kepler and CARMA Systems

Pathscale ENZO GTC12 S0631 – Programming Heterogeneous Many-Cores Using Directives C

Locality-Aware Automatic Parallelization for GPGPU with Openhmpp Directives

How to Write Code That Will Survive

CAPS Openacc Compiler

This Is the GPU Based ANUGA

Downloads Parallel Code and GPU So That the Program Can Execute Output Data from the HWA to the Host

Program Sequentially, Carefully, and Benefit from Compiler Advances For

An Introduction to Parallel Programming

Analysis of Task Scheduling for Multi-Core Embedded Systems

Warwick.Ac.Uk/Lib-Publications MENS T a T a G I MOLEM

Opencl 2.0, Opencl SYCL & Openmp 4

Hybrid Programming Using Openshmem and Openacc