OpenCL – OpenACC & Exascale

F. Bodin Introduction

• Exascale architectures may be o o Heterogeneous compute units o Hierarchical memory systems o Unreliable o Asynchronous o Very energy saving oriented o … • Exascale roadmap needs to be build on programming standards o Nobody can afford re-writing applications again and again o Exascale roadmap, HPC, mass market many-core and embedded systems are sharing many common issues o Exascale is not about an heroic technology development o Exascale project must provide technology for a large industry base/uses • OpenACC and OpenCL may be candidates o Dealing with inside the node o Part of a standardization initiative o OpenACC complementary to OpenCL • This presentation try to forecast OpenACC and OpenCL in the light of Exascale challenges o Challenges as identified by the ETP4HPC (http://www.etp4hpc.eu)

WSTOOLS 2012 www.caps-entreprise.com 2 http://www.etp4hpc.eu

WSTOOLS 2012 www.caps-entreprise.com 3 Programming Environments Context

1. Standardization initiative o Software developments need visibility

2. Intellectual property issues o Impact on tools development and interactions o Fundamental for creating an ecosystem o Everything open source is not the answer

3. Software engineering, applications and users expectations o Minimizing maintenance effort, one code for all targets

4. Tools development strategy o How to create coherent ecosystem?

WSTOOLS 2012 www.caps-entreprise.com 4 Outline of the presentation

• A very short overview of OpenCL

• A very short overview of OpenACC

• OpenACC-OpenCL versus Exascale challenges

WSTOOLS 2012 www.caps-entreprise.com 5 OpenCL Overview

• Open Computing Language o -based cross-platform programming interface o Subset of ISO C99 with language extensions o Data- and task- parallel compute model

• Host-Compute Devices (GPUs) model

• Platform layer API and runtime API o Hardware abstraction layer, … o Manage resources

• Portable syntax

WSTOOLS 2012 www.caps-entreprise.com 6 Memory Model

• Four distinct memory regions o Global Memory o Local Memory o Constant Memory o Private Memory

• Global and Constant memories are common to all WI o May be cached depending on the hardware capabilities

• Local memory is shared by all WI of a WG

• Private memory is private to each WI

www.caps-entreprise.com 7 OpenCL Memory Hierarchy

From Aaftab Munshi’s talk at Siggraph2008

WSTOOLS 2012 www.caps-entreprise.com 8 Data-Parallelism in OpenCL

• A kernel is executed by the work-items o Same parallel model as Cuda (< 4.x)

// OpenCL Kernel Function for element by element vector addition __kernel void VectorAdd(__global const float8* a, __global const float8* b, __global float8* c) { // get oct-float index into global data array int iGID = get_global_id(0);

// read inputs into registers float8 f8InA = a[iGID]; float8 f8InB = b[iGID]; float8 f8Out = (float8)0.0f;

// add the vector elements f8Out.s0 = f8InA.s0 + f8InB.s0; f8Out.s1 = f8InA.s1 + f8InB.s1; f8Out.s2 = f8InA.s2 + f8InB.s2; f8Out.s3 = f8InA.s3 + f8InB.s3; f8Out.s4 = f8InA.s4 + f8InB.s4; f8Out.s5 = f8InA.s5 + f8InB.s5; f8Out.s6 = f8InA.s6 + f8InB.s6; f8Out.s7 = f8InA.s7 + f8InB.s7;

// write back out to GMEM c[get_global_id(0)] = f8Out; }

WSTOOLS 2012 www.caps-entreprise.com 9 OpenCL vs CUDA

• OpenCL and CUDA share the same parallel programming model OPENCL CUDA

kernel kernel • Runtime API is different host pgm host pgm o OpenCL is lower level than CUDA NDrange grid work item work group block • OpenCL and CUDA may use Global mem global mem different implementations that cst mem cst mem could lead to different execution local mem shared mem times for a similar kernel on the private mem local mem same hardware

November 2011 www.caps-entreprise.com 10 Basic OpenCL Runtime Operations

• Create a command-queue

• Then queue up OpenCL events o Data transfers o Kernel launches

• Allocate the accelerator’s memory o Before transferring data

• Free the memory

• Manage errors

www.caps-entreprise.com 11 The Command Queue (1)

• Command-queue can be used to queue a set of operations

• Having multiple command-queues allows applications to queue multiple independent commands without requiring synchronization

• Create an OpenCL command-queue

cl_command_queue clCreateCommandQueue( cl_context context, cl_device_id device, cl_command_queue_properties properties, cl_int *errcode_ret)

www.caps-entreprise.com 12 The Command Queue (2)

• Example o Allocation of a queue for the device 0

cl_command_queue my_cmd_queue; my_cmd_queue = clCreateCommandQueue(my_context, devices[0], 0, NULL);

• Flush a command queue o All commands have started

cl_int clFlush(cl_command_queue command_queue)

• Finish a command queue (synchronization) o All commands have terminated

cl_int clFinish(cl_command_queue command_queue)

www.caps-entreprise.com 13 The Command Queue (3)

• Possible to have multiple command queues on a device o Command queues are Asynchronous o The programmer must synchronize when needed

www.caps-entreprise.com 14 How to Allocate Memory on a device ?

• Memory objects are categorized into two types o Buffer Objects : 1D memory o Image Objects : 2D-3D memory

• It can be o A scalar data type o A vector data type o A user defined structure

• Memory objects are described by a cl_mem object

• Kernels take cl_mem objects as input or output

www.caps-entreprise.com 15 Allocate 1D Memory

• Create a buffer

cl_mem clCreateBuffer( cl_context context, cl_mem_flags flags, size_t size_in_bytes, void *host_ptr, cl_int *errcode_ret) • Example o Allocate a single precision float matrix as input

size_t memsize = nb_elements * sizeof(float); cl_mem mat_a_gpu = clCreateBuffer(my_context, CL_MEM_READ_ONLY, memsize, NULL, &err);

o And as output

cl_mem mat_res_gpu = clCreateBuffer(my_context, CL_MEM_WRITE_ONLY, memsize, NULL, &err);

www.caps-entreprise.com 16 Transfer data to Device (1)

• Any data transfer to/from the device implies o A host pointer o A device memory object o The size of data (in bytes) o The command queue o If it is a blocking transfer or not o …

• In case of non-blocking transfer o Link a cl_event to the transfer o Check the transfer finish with :

cl_int clWaitForEvents (cl_uint num_events, const cl_event *event_list)

www.caps-entreprise.com 17 Transfer data to Device (2)

• Write in a buffer

cl_int clEnqueueWriteBuffer ( cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_write, size_t offset, size_t size_in_bytes, const void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) • Example o Transferring synchronously mat_a!

err = clEnqueueWriteBuffer(cmd_queue, mat_a_gpu, CL_TRUE, 0, memsize, (void*) mat_a, 0,NULL,&evt);

www.caps-entreprise.com 18 Kernel Arguments

• A kernel needs arguments o So we must set these arguments o Arguments can be scalar, vector or user-defined data types • Set the kernel arguments

cl_int clSetKernelArg( cl_kernel kernel, cl_uint arg_index, size_t arg_size, const void *arg_value)

• Example o Set a argument o Set res argument o Set size argument

err = clSetKernelArg(my_kernel, 0, sizeof(cl_mem), (void*) &mat_a_gpu); err = clSetKernelArg(my_kernel, 1, sizeof(cl_mem), (void*) &mat_res_gpu); err = clSetKernelArg(my_kernel, 2, sizeof(int), (void*) & nb_elements);

www.caps-entreprise.com 19 Settings for Kernel Launching

• Set the NDRange (grid) geometry

int global_work_size[2] = {nb_elements_x, nb_elements_y}; int local_work_size[2] = {16, 16};

• Task parallel model is used for CPU o General task : complex, independent, …

• Data-parallel is used for GPU o Need to set a grid, NDRange in OpenCL

www.caps-entreprise.com 20 Kernel Launch (1)

• If task kernel o Use the queued task command

cl_int clEnqueueTask(cl_command_queue command_queue, cl_kernel kernel, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)

• If data-parallel kernel o Use the queued NDRange kernel command

cl_int clEnqueueNDRangeKernel( cl_command_queue command_queue, cl_kernel kernel, cl_uint work_dim, const size_t *global_work_offset, const size_t *global_work_size, const size_t *local_work_size, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event)

www.caps-entreprise.com 21 Kernel Launch (2)

• The launch of the kernel is asynchronous by default

err = clEnqueueNDRangeKernel( cmd_queue, kernel[0], 2, NULL, &global_work_size[0], &local_work_size[0], 0, NULL, &evt); clFinish(cmd_queue);

www.caps-entreprise.com 22 Copy Back the Results

• About the same as the copy in o From device to host

cl_int clEnqueueReadBuffer( cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t sizeinbytes, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) • Example

err = clEnqueueReadBuffer(cmd_queue, res_mem, CL_TRUE, 0, size, (void*) mat_res, 0, NULL, NULL); clFinish(cmd_queue);

www.caps-entreprise.com 23 Free Device’s Memory

• At the end you need to release the allocated memory

• Releasing the memory

cl_int clReleaseMemObject(cl_mem memobj)

• Example o Release matrix a on GPU o Release matrix res on GPU

clReleaseMemObject(mat_a_gpu); clReleaseMemObject(mat_res_gpu);

www.caps-entreprise.com 24 Release Objects

• At the end, you must release o The programs o The kernels o The command queues o And the context

cl_int clReleaseKernel (cl_kernel kernel)

cl_int clReleaseProgram (cl_program program)

cl_int clReleaseCommandQueue (cl_command_queue command_queue)

cl_int clReleaseContext (cl_context context)

www.caps-entreprise.com 25 Error Management

• Do not forget to manage errors o Example with the copy back

// read output image err = clEnqueueReadBuffer(cmd_queue, memobjs[2], CL_TRUE, 0, n * sizeof(cl_float), dst, 0, NULL, NULL);

if (err != CL_SUCCESS) { delete_memobjs(memobjs, 3); clReleaseKernel(kernel); clReleaseProgram(program); clReleaseCommandQueue(cmd_queue); clReleaseContext(context);return -1; }

www.caps-entreprise.com 26 OpenCL Interesting Characteristics

• Parallel programming model o Simple and massively parallel (for node programming) o Helps to exploit memory bandwidth via vector accesses o Performance is kind of "predictable" (i.e. similar to MPI that have explicit communication)

• Fine grain control over resources o Details API, JIT Compilation

• Available on most platforms o /AMD/ARM GPUs o /AMD/… CPU o Intel MIC

WSTOOLS 2012 www.caps-entreprise.com 27 OpenACC Concepts

• Express data and computations to be executed on an accelerator o Using marked code regions

• Main OpenACC constructs Data/stream/vector parallelism to be o Kernel regions exploited by HWA o Loops e.g. CUDA / OpenCL o Data regions

• Runtime API

CPU and HWA linked with a PCIx bus WSTOOLS 2012 www.caps-entreprise.com 28 OpenACC Kernels Regions

• Identifies sections of code to be executed on the accelerator • Parallel loops inside a kernel region are turned into accelerator kernels o Such as CUDA or OpenCL kernels o Different loop nests may have different gridifications

#pragma acc kernels { for (int i = 0; i < n; ++i){ for (int j = 0; j < n; ++j){ for (int k = 0; k < n; ++k){ B[i][j*k%n] = A[i][j*k%n]; } } } ... for (int i = 0; i < n; ++i){ for (int j = 0; j < m; ++j){ B[i][j] = i * j * A[i][j]; } } ... }

WSTOOLS 2012 www.caps-entreprise.com 29 OpenACC Kernel Execution Model

• Based on three parallelism levels o Gangs – coarse-grain o Workers – fine-grain o Vectors – Finest grain • Mapping on physical architecture compiler dependent

Device Gang Worker

Vector s

WSTOOLS 2012 www.caps-entreprise.com 30 OpenACC Loop independent Directive

• Inserted inside Kernels regions

• Describes that a loop is data-independent

• Other clauses can declare loop-private variables or arrays, and reduction operations Iteration s of variable i are #pragma acc loop independent data for (int i = 0; i < n; ++i){ independent B[i][0] = 1; for (int j = 1; j < m; ++j){ B[i][j] = i * j * A[i][j-1]; } Iterations of } variable j are not data independent

WSTOOLS 2012 www.caps-entreprise.com 31 OpenACC Data Regions

• Data regions define scalars, arrays and sub-arrays o To be allocated in the device memory for the duration of the region o To be explicitly managed using transfer clauses or directives

• Optimizing transfers consists in: o Transferring data • From CPU to GPU when entering a data region • From GPU to CPU when exiting a data region o Launching several kernels • That can reuse the data inside this data region

• Kernels regions implicitly define data regions

WSTOOLS 2012 www.caps-entreprise.com 32 Data Management Directives Example

#pragma acc data copyin(A[1:N-2]), copyout(B[N]) { #pragma acc kernels { #pragma acc loop independant for (int i = 0; i < N; ++i){ A[i][0] = ...; A[i][M - 1] = 0.0f; } ... } #pragma acc update host(A) ... #pragma acc kernels for (int i = 0; i < n; ++i){ B[i] = ...; } }

WSTOOLS 2012 www.caps-entreprise.com 33 CAPS OpenACC Compiler flow

C++ C Frontend Frontend Frontend

Extraction module

codelets Host Fu Fu n Fu code n#1 #2 n #3

Instrumen- CUDA OpenCL tation Code Generatio module Generation n

CPU compiler CUDA OpenCL (gcc, ifort, compilers compilers …)

Executable CAPS HWA Code (mybin.exe) Runtime (Dyn. library)

WSTOOLS 2012 www.caps-entreprise.com 34 OpenACC Interesting Characteristics

• Talk about data structures and multiple address spaces o Does not assume a coherent unique shared address space • Simple parallel model o Parallel nested loops • Easy to extend o Directives or clauses can be goal specific, does not modify base language • Available on many platforms o Via the use of OpenCL as a compiler target • Many open source projects o Many of them based on LLVM • Should be very closed to OpenMP Accelerator Extension o Worst case, easy to move OpenACC to OpenMP accelerator extension

WSTOOLS 2012 www.caps-entreprise.com 35 Exascale Programming Systems Technological Challenges 1. Parallel programming

2. Runtime support/systems

3. Debugging and correctness Discussed in the remainder

4. High performance libraries and components

5. Performance tools

6. Tools infrastructure Topic extracted from the etp4hpc SRA programming environment http://www.etp4hpc.eu 7. Cross cutting issues

WSTOOLS 2012 www.caps-entreprise.com 36 Parallel Programming APIs Research Topics

• Domain specific languages • API for legacy codes • MPI + X approaches • Partitioned Global Address Space (PGAS) languages and APIs • Dealing with hardware heterogeneity • Data oriented approaches and languages • Auto-tuning API • Asynchronous programming models and languages

WSTOOLS 2012 www.caps-entreprise.com 37 API for legacy codes

• OpenACC o Directive based approaches particularly suited to legacy codes o Focused on heterogeneous node o Not C only targets also Fortran and C++

• OpenCL o Not that convenient for legacy codes o Complex to mix with OpenMP o Can be used to unify multithreading

WSTOOLS 2012 www.caps-entreprise.com 38 MPI + X approaches

• OpenACC o Complementary to MPI o Complex to mix with OpenMP, i.e. balancing the load over the CPUs and accelerators o OpenACC to deal with threads and accelerator parallelism  but parallelism expression not for all applications

• OpenCL o idem

WSTOOLS 2012 www.caps-entreprise.com 39 Dealing with hardware heterogeneity

• OpenACC o Designed for this o May simplify code tuning o No automatic load balancing over the heterogeneous units, need to be extended o Better understanding of the code by the compiler (e.g. exposed data management, parallel nested loops) • Provide restructuring capabilities o May be extended to consider non volatile memories (NVM) o Does not consider multiple accelerators • Extension to come

• OpenCL o Designed for this o Code tuning exposes many low level details o Detailed API for resources management • Gives many control to users • Programming may be complex o Interesting parallel model to help vectorization

WSTOOLS 2012 www.caps-entreprise.com 40 Auto-tuning API

• Targeting performance portability issues • What would provide a auto-tuning API? o Decision point description • e.g. callsite o Variants description • Abstract syntax trees • Execution constraints (e.g. specialized codelets) o Execution context • Parameter values • Hardware target description and allocation o Runtime control to select variants or drive runtime code generation • OpenACC o OpenACC gives more opportunity to compilers/automatic tools o Can be extended to provide a standard API o Many tuning techniques over parallel loops o Orthogonal to programming • OpenCL o Can integrate auto-tuning but may be limited in scope o OpenCL is low level, guessing high level properties difficult

WSTOOLS 2012 www.caps-entreprise.com 41 Asynchronous programming models and languages • OpenACC o Limited asynchronous capabilities, constraints by the host-accelerator model o Not suited for data flow approaches, need to be extended (OpenHMPP codelet concept more suitable for this)

• OpenCL o idem

WSTOOLS 2012 www.caps-entreprise.com 42 Debugging and Correctness Research Topics 1. Debugging heterogeneous/hybrid code 2. Static debugging 3. Dynamic debugging 4. Debugging highly asynchronous parallel code at full (Peta-,Exa-)scale 5. Runtime and debugger integration 6. Model aware debugging 7. Automatic techniques

WSTOOLS 2012 www.caps-entreprise.com 43 Debugging heterogeneous/hybrid code

• OpenACC o Preserve most of the serial semantic, helps a lot to design debugging tools o Helps to distinguish serial bugs from parallel ones o Programming can be very incremental, simplifying debugging

• OpenCL o Complex debugging due to many low level details and parallel / memory model

WSTOOLS 2012 www.caps-entreprise.com 44 High performance libraries and components Research Topics 1. Application componentization 2. Templates/skeleton/component based approaches and languages 3. Components / library interoperability 4. Self- / auto-tuning libraries and components 5. New parallel algorithms / parallelization paradigms; e.g. resilient algorithms

WSTOOLS 2012 www.caps-entreprise.com 45 Templates/skeleton/component based approaches and languages

• OpenACC o Can be used to write libraries, can exploit already allocated data/HW (pcopy clause) o If extended with tuning directives such as hmppcg (e.g. loop transformations) can be used to express templates: • Templates to express static code transformations • Use runtime technique to tune dynamic parameters such as the number of gangs, workers and vector sizes

• OpenCL o Used a lot to write libraries o Fits well with C++ components development

WSTOOLS 2012 www.caps-entreprise.com 46 Components / library interoperability

• Library calls can usually only be partially replaced o Want a unique source code even when using accelerated libraries, CPU version is the reference point o No one-to-one mapping between libraries (e.g.BLAS  Cublas, FFTW  CuFFT) o No access to all application codes (i.e. need to keep the CPU library)

• Deal with multiple address spaces / multi-HWA o Data location may not be unique (copies, mirrors) o Usual library calls assume o Library efficiency depends on updated data location (long term effect)

• OpenACC o Needs to interact with users codes, currently limited to sharing the device data ptr o Missing automatic data management allocation (e.g. StarPU) to deal with computation migrations (needed to adapt to hardware resources and compute load)

• OpenCL o OpenACC and OpenCL have to interact efficiently o API can easily be normalized thanks to standardization initiative

WSTOOLS 2012 www.caps-entreprise.com 47 Library Interoperability in HMPP 3.0

C/CUDA/OCL/… HMPP Native GPU GPU Lib Runtime API Runtime API ... ... ... ... ... ... call libRoutine1(…) ... ... ... ... gpuLib(…) ... ... ... #pragma hmppalt ... ... call libRoutine2(…) ... CPU Lib ... ... ... ... proxy2(…) cpuLib1(…) ... ... ... ... ... ... call libRoutine3(…) ... ... ... cpuLib3(…) ... ...

CC 2012 www.caps-entreprise.com 48 Self- / auto-tuning libraries and components

• OpenACC o Already provided dynamic parameters for select variant HMPP code tuning (e.g. compiler #workers) codelet variant 1 o Need to be extended to allow code templates/ codelet variant 2 skeletons descriptions codelet variant 3 dynamic

• OpenCL codelet variant … o Maybe not the right level, a bit too low level Execution o Except for vectorization feedback techniques

WSTOOLS 2012 www.caps-entreprise.com 49 [Cross cutting issues]

1. Standardization initiative 2. Fault tolerance at programming level 3. Programming energy consumption control 4. Tools interfaces and public APIs 5. Intellectual property issues 6. Performance portability issues 7. Software engineering, applications and users expectations 8. Tools development strategy 9. Validation: Benchmarks and other mini-apps 10. Co-design (hardware - software; applications -programming environment)

WSTOOLS 2012 www.caps-entreprise.com 50 Fault Tolerance at Programming Level

• OpenACC o OpenACC data region can be extended to mark structures for specific fault tolerance management o Extension of the memory model for NVM, etc.

• OpenCL o Data management via the API makes it difficult for static tools (e.g. compiler, analyzer)

WSTOOLS 2012 www.caps-entreprise.com 51 Validation: Benchmarks and other mini-apps

• OpenACC o Extremely important to have exascale potential good measurements o Kernels are not enough o Tools are usually designed to match benchmark requirement • Very influential of the output o Mini-apps (e.g. Hydro/Prace, Mantevo) pragmatic and efficient approach • But extremely expensive to design • Must be production quality • Need to exhibit extremely scalable algorithms o On the critical path for the foundation of an ecosystem • OpenCL o Idem o Limited to C

WSTOOLS 2012 www.caps-entreprise.com 52 Conclusion

• OpenACC provides an interesting framework for designing an Exascale, non revolutionary, programming environment o Leverage existing academic and industrial initiative o May be used as a basic infrastructure for higher level approach o Mixable with MPI, PGAS, … o Available on many hardware targets

• OpenCL very complementary as a device basic programming layer

• Need to mix in a consistent manner high-level and low-level approaches o Inside nodes, OpenACC/OpenMP-AE U OpenCL at list worth a try as a basis for studying an exascale programming environments o Complementary with more revolutionary approaches

WSTOOLS 2012 www.caps-entreprise.com 53 Accelerator Programming Model Parallelization Directive-based programming GPGPU Manycore programming

Hybrid Manycore Programming HPC community OpenACC

Petaflops HPC open standard Multicore programming Exaflops NVIDIA Cuda Code Hardware accelerators programming High Performance Computing OpenHMPP Parallel programming interface Massively parallel Open CL

http://www.caps-entreprise.com http://twitter.com/CAPSentreprise http://www.openacc-standard.org/ http://www.openhmpp.org