Opencl – Openacc & Exascale

OpenCL – OpenACC & Exascale F. Bodin Introduction • Exascale architectures may be o Massively parallel o Heterogeneous compute units o Hierarchical memory systems o Unreliable o Asynchronous o Very energy saving oriented o … • Exascale roadmap needs to be build on programming standards o Nobody can afford re-writing applications again and again o Exascale roadmap, HPC, mass market many-core and embedded systems are sharing many common issues o Exascale is not about an heroic technology development o Exascale project must provide technology for a large industry base/uses • OpenACC and OpenCL may be candidates o Dealing with inside the node o Part of a standardization initiative o OpenACC complementary to OpenCL • This presentation try to forecast OpenACC and OpenCL in the light of Exascale challenges o Challenges as identified by the ETP4HPC (http://www.etp4hpc.eu) WSTOOLS 2012 www.caps-entreprise.com 2 http://www.etp4hpc.eu WSTOOLS 2012 www.caps-entreprise.com 3 Programming Environments Context 1. Standardization initiative o Software developments need visibility 2. Intellectual property issues o Impact on tools development and interactions o Fundamental for creating an ecosystem o Everything open source is not the answer 3. Software engineering, applications and users expectations o Minimizing maintenance effort, one code for all targets 4. Tools development strategy o How to create coherent ecosystem? WSTOOLS 2012 www.caps-entreprise.com 4 Outline of the presentation • A very short overview of OpenCL • A very short overview of OpenACC • OpenACC-OpenCL versus Exascale challenges WSTOOLS 2012 www.caps-entreprise.com 5 OpenCL Overview • Open Computing Language o C-based cross-platform programming interface o Subset of ISO C99 with language extensions o Data- and task- parallel compute model • Host-Compute Devices (GPUs) model • Platform layer API and runtime API o Hardware abstraction layer, … o Manage resources • Portable syntax WSTOOLS 2012 www.caps-entreprise.com 6 Memory Model • Four distinct memory regions o Global Memory o Local Memory o Constant Memory o Private Memory • Global and Constant memories are common to all WI o May be cached depending on the hardware capabilities • Local memory is shared by all WI of a WG • Private memory is private to each WI www.caps-entreprise.com 7 OpenCL Memory Hierarchy From Aaftab Munshi’s talk at Siggraph2008 WSTOOLS 2012 www.caps-entreprise.com 8 Data-Parallelism in OpenCL • A kernel is executed by the work-items o Same parallel model as Cuda (< 4.x) // OpenCL Kernel Function for element by element vector addition! __kernel void VectorAdd(__global const float8* a, __global const float8* b, __global float8* c)! {! // get oct-float index into global data array! int iGID = get_global_id(0);! // read inputs into registers! float8 f8InA = a[iGID];! float8 f8InB = b[iGID];! float8 f8Out = (float8)0.0f;! // add the vector elements! f8Out.s0 = f8InA.s0 + f8InB.s0;! f8Out.s1 = f8InA.s1 + f8InB.s1;! f8Out.s2 = f8InA.s2 + f8InB.s2;! f8Out.s3 = f8InA.s3 + f8InB.s3;! f8Out.s4 = f8InA.s4 + f8InB.s4;! f8Out.s5 = f8InA.s5 + f8InB.s5;! f8Out.s6 = f8InA.s6 + f8InB.s6;! f8Out.s7 = f8InA.s7 + f8InB.s7;! // write back out to GMEM! c[get_global_id(0)] = f8Out;! }! WSTOOLS 2012 www.caps-entreprise.com 9 OpenCL vs CUDA • OpenCL and CUDA share the same parallel programming model OPENCL CUDA kernel kernel • Runtime API is different host pgm host pgm o OpenCL is lower level than CUDA NDrange grid work item thread work group block • OpenCL and CUDA may use Global mem global mem different implementations that cst mem cst mem could lead to different execution local mem shared mem times for a similar kernel on the private mem local mem same hardware November 2011 www.caps-entreprise.com 10 Basic OpenCL Runtime Operations • Create a command-queue • Then queue up OpenCL events o Data transfers o Kernel launches • Allocate the accelerator’s memory o Before transferring data • Free the memory • Manage errors www.caps-entreprise.com 11 The Command Queue (1) • Command-queue can be used to queue a set of operations • Having multiple command-queues allows applications to queue multiple independent commands without requiring synchronization • Create an OpenCL command-queue cl_command_queue clCreateCommandQueue( cl_context context, cl_device_id device, cl_command_queue_properties properties, cl_int *errcode_ret) www.caps-entreprise.com 12 The Command Queue (2) • Example o Allocation of a queue for the device 0 cl_command_queue my_cmd_queue; my_cmd_queue = clCreateCommandQueue(my_context, devices[0], 0, NULL); • Flush a command queue o All commands have started cl_int clFlush(cl_command_queue command_queue) • Finish a command queue (synchronization) o All commands have terminated cl_int clFinish(cl_command_queue command_queue) www.caps-entreprise.com 13 The Command Queue (3) • Possible to have multiple command queues on a device o Command queues are Asynchronous o The programmer must synchronize when needed www.caps-entreprise.com 14 How to Allocate Memory on a device ? • Memory objects are categorized into two types o Buffer Objects : 1D memory o Image Objects : 2D-3D memory • It can be o A scalar data type o A vector data type o A user defined structure • Memory objects are described by a cl_mem object • Kernels take cl_mem objects as input or output www.caps-entreprise.com 15 Allocate 1D Memory • Create a buffer cl_mem clCreateBuffer( cl_context context, cl_mem_flags flags, size_t size_in_bytes, void *host_ptr, cl_int *errcode_ret) • Example o Allocate a single precision float matrix as input size_t memsize = nb_elements * sizeof(float); cl_mem mat_a_gpu = clCreateBuffer(my_context, CL_MEM_READ_ONLY, memsize, NULL, &err); o And as output cl_mem mat_res_gpu = clCreateBuffer(my_context, CL_MEM_WRITE_ONLY, memsize, NULL, &err); www.caps-entreprise.com 16 Transfer data to Device (1) • Any data transfer to/from the device implies o A host pointer o A device memory object o The size of data (in bytes) o The command queue o If it is a blocking transfer or not o … • In case of non-blocking transfer o Link a cl_event to the transfer o Check the transfer finish with : cl_int clWaitForEvents (cl_uint num_events, const cl_event *event_list) www.caps-entreprise.com 17 Transfer data to Device (2) • Write in a buffer cl_int clEnqueueWriteBuffer ( cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_write, size_t offset, size_t size_in_bytes, const void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) • Example o Transferring synchronously mat_a! err = clEnqueueWriteBuffer(cmd_queue, mat_a_gpu, CL_TRUE, 0, memsize, (void*) mat_a, 0,NULL,&evt); www.caps-entreprise.com 18 Kernel Arguments • A kernel needs arguments o So we must set these arguments o Arguments can be scalar, vector or user-defined data types • Set the kernel arguments cl_int clSetKernelArg( cl_kernel kernel, cl_uint arg_index, size_t arg_size, const void *arg_value) • Example o Set a argument o Set res argument o Set size argument err = clSetKernelArg(my_kernel, 0, sizeof(cl_mem), (void*) &mat_a_gpu); err = clSetKernelArg(my_kernel, 1, sizeof(cl_mem), (void*) &mat_res_gpu); err = clSetKernelArg(my_kernel, 2, sizeof(int), (void*) & nb_elements); www.caps-entreprise.com 19 Settings for Kernel Launching • Set the NDRange (grid) geometry int global_work_size[2] = {nb_elements_x, nb_elements_y}; int local_work_size[2] = {16, 16}; • Task parallel model is used for CPU o General task : complex, independent, … • Data-parallel is used for GPU o Need to set a grid, NDRange in OpenCL www.caps-entreprise.com 20 Kernel Launch (1) • If task kernel o Use the queued task command cl_int clEnqueueTask(cl_command_queue command_queue, cl_kernel kernel, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) • If data-parallel kernel o Use the queued NDRange kernel command cl_int clEnqueueNDRangeKernel( cl_command_queue command_queue, cl_kernel kernel, cl_uint work_dim, const size_t *global_work_offset, const size_t *global_work_size, const size_t *local_work_size, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) www.caps-entreprise.com 21 Kernel Launch (2) • The launch of the kernel is asynchronous by default err = clEnqueueNDRangeKernel( cmd_queue, kernel[0], 2, NULL, &global_work_size[0], &local_work_size[0], 0, NULL, &evt); clFinish(cmd_queue); www.caps-entreprise.com 22 Copy Back the Results • About the same as the copy in o From device to host cl_int clEnqueueReadBuffer( cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t sizeinbytes, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) • Example err = clEnqueueReadBuffer(cmd_queue, res_mem, CL_TRUE, 0, size, (void*) mat_res, 0, NULL, NULL); clFinish(cmd_queue); www.caps-entreprise.com 23 Free Device’s Memory • At the end you need to release the allocated memory • Releasing the memory cl_int clReleaseMemObject(cl_mem memobj) • Example o Release matrix a on GPU o Release matrix res on GPU clReleaseMemObject(mat_a_gpu); clReleaseMemObject(mat_res_gpu); www.caps-entreprise.com 24 Release Objects • At the end, you must release o The programs o The kernels o The command queues o And the context cl_int clReleaseKernel (cl_kernel kernel) cl_int clReleaseProgram (cl_program program) cl_int clReleaseCommandQueue (cl_command_queue command_queue) cl_int clReleaseContext (cl_context context) www.caps-entreprise.com 25 Error Management • Do not forget to

Load more