Opencl – Openacc & Exascale

OpenCL – OpenACC & Exascale F. Bodin Introduction • Exascale architectures may be o Massively parallel o Heterogeneous compute units o Hierarchical memory systems o Unreliable o Asynchronous o Very energy saving oriented o … • Exascale roadmap needs to be build on programming standards o Nobody can afford re-writing applications again and again o Exascale roadmap, HPC, mass market many-core and embedded systems are sharing many common issues o Exascale is not about an heroic technology development o Exascale project must provide technology for a large industry base/uses • OpenACC and OpenCL may be candidates o Dealing with inside the node o Part of a standardization initiative o OpenACC complementary to OpenCL • This presentation try to forecast OpenACC and OpenCL in the light of Exascale challenges o Challenges as identified by the ETP4HPC (http://www.etp4hpc.eu) WSTOOLS 2012 www.caps-entreprise.com 2 http://www.etp4hpc.eu WSTOOLS 2012 www.caps-entreprise.com 3 Programming Environments Context 1. Standardization initiative o Software developments need visibility 2. Intellectual property issues o Impact on tools development and interactions o Fundamental for creating an ecosystem o Everything open source is not the answer 3. Software engineering, applications and users expectations o Minimizing maintenance effort, one code for all targets 4. Tools development strategy o How to create coherent ecosystem? WSTOOLS 2012 www.caps-entreprise.com 4 Outline of the presentation • A very short overview of OpenCL • A very short overview of OpenACC • OpenACC-OpenCL versus Exascale challenges WSTOOLS 2012 www.caps-entreprise.com 5 OpenCL Overview • Open Computing Language o C-based cross-platform programming interface o Subset of ISO C99 with language extensions o Data- and task- parallel compute model • Host-Compute Devices (GPUs) model • Platform layer API and runtime API o Hardware abstraction layer, … o Manage resources • Portable syntax WSTOOLS 2012 www.caps-entreprise.com 6 Memory Model • Four distinct memory regions o Global Memory o Local Memory o Constant Memory o Private Memory • Global and Constant memories are common to all WI o May be cached depending on the hardware capabilities • Local memory is shared by all WI of a WG • Private memory is private to each WI www.caps-entreprise.com 7 OpenCL Memory Hierarchy From Aaftab Munshi’s talk at Siggraph2008 WSTOOLS 2012 www.caps-entreprise.com 8 Data-Parallelism in OpenCL • A kernel is executed by the work-items o Same parallel model as Cuda (< 4.x) // OpenCL Kernel Function for element by element vector addition! __kernel void VectorAdd(__global const float8* a, __global const float8* b, __global float8* c)! {! // get oct-float index into global data array! int iGID = get_global_id(0);! // read inputs into registers! float8 f8InA = a[iGID];! float8 f8InB = b[iGID];! float8 f8Out = (float8)0.0f;! // add the vector elements! f8Out.s0 = f8InA.s0 + f8InB.s0;! f8Out.s1 = f8InA.s1 + f8InB.s1;! f8Out.s2 = f8InA.s2 + f8InB.s2;! f8Out.s3 = f8InA.s3 + f8InB.s3;! f8Out.s4 = f8InA.s4 + f8InB.s4;! f8Out.s5 = f8InA.s5 + f8InB.s5;! f8Out.s6 = f8InA.s6 + f8InB.s6;! f8Out.s7 = f8InA.s7 + f8InB.s7;! // write back out to GMEM! c[get_global_id(0)] = f8Out;! }! WSTOOLS 2012 www.caps-entreprise.com 9 OpenCL vs CUDA • OpenCL and CUDA share the same parallel programming model OPENCL CUDA kernel kernel • Runtime API is different host pgm host pgm o OpenCL is lower level than CUDA NDrange grid work item thread work group block • OpenCL and CUDA may use Global mem global mem different implementations that cst mem cst mem could lead to different execution local mem shared mem times for a similar kernel on the private mem local mem same hardware November 2011 www.caps-entreprise.com 10 Basic OpenCL Runtime Operations • Create a command-queue • Then queue up OpenCL events o Data transfers o Kernel launches • Allocate the accelerator’s memory o Before transferring data • Free the memory • Manage errors www.caps-entreprise.com 11 The Command Queue (1) • Command-queue can be used to queue a set of operations • Having multiple command-queues allows applications to queue multiple independent commands without requiring synchronization • Create an OpenCL command-queue cl_command_queue clCreateCommandQueue( cl_context context, cl_device_id device, cl_command_queue_properties properties, cl_int *errcode_ret) www.caps-entreprise.com 12 The Command Queue (2) • Example o Allocation of a queue for the device 0 cl_command_queue my_cmd_queue; my_cmd_queue = clCreateCommandQueue(my_context, devices[0], 0, NULL); • Flush a command queue o All commands have started cl_int clFlush(cl_command_queue command_queue) • Finish a command queue (synchronization) o All commands have terminated cl_int clFinish(cl_command_queue command_queue) www.caps-entreprise.com 13 The Command Queue (3) • Possible to have multiple command queues on a device o Command queues are Asynchronous o The programmer must synchronize when needed www.caps-entreprise.com 14 How to Allocate Memory on a device ? • Memory objects are categorized into two types o Buffer Objects : 1D memory o Image Objects : 2D-3D memory • It can be o A scalar data type o A vector data type o A user defined structure • Memory objects are described by a cl_mem object • Kernels take cl_mem objects as input or output www.caps-entreprise.com 15 Allocate 1D Memory • Create a buffer cl_mem clCreateBuffer( cl_context context, cl_mem_flags flags, size_t size_in_bytes, void *host_ptr, cl_int *errcode_ret) • Example o Allocate a single precision float matrix as input size_t memsize = nb_elements * sizeof(float); cl_mem mat_a_gpu = clCreateBuffer(my_context, CL_MEM_READ_ONLY, memsize, NULL, &err); o And as output cl_mem mat_res_gpu = clCreateBuffer(my_context, CL_MEM_WRITE_ONLY, memsize, NULL, &err); www.caps-entreprise.com 16 Transfer data to Device (1) • Any data transfer to/from the device implies o A host pointer o A device memory object o The size of data (in bytes) o The command queue o If it is a blocking transfer or not o … • In case of non-blocking transfer o Link a cl_event to the transfer o Check the transfer finish with : cl_int clWaitForEvents (cl_uint num_events, const cl_event *event_list) www.caps-entreprise.com 17 Transfer data to Device (2) • Write in a buffer cl_int clEnqueueWriteBuffer ( cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_write, size_t offset, size_t size_in_bytes, const void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) • Example o Transferring synchronously mat_a! err = clEnqueueWriteBuffer(cmd_queue, mat_a_gpu, CL_TRUE, 0, memsize, (void*) mat_a, 0,NULL,&evt); www.caps-entreprise.com 18 Kernel Arguments • A kernel needs arguments o So we must set these arguments o Arguments can be scalar, vector or user-defined data types • Set the kernel arguments cl_int clSetKernelArg( cl_kernel kernel, cl_uint arg_index, size_t arg_size, const void *arg_value) • Example o Set a argument o Set res argument o Set size argument err = clSetKernelArg(my_kernel, 0, sizeof(cl_mem), (void*) &mat_a_gpu); err = clSetKernelArg(my_kernel, 1, sizeof(cl_mem), (void*) &mat_res_gpu); err = clSetKernelArg(my_kernel, 2, sizeof(int), (void*) & nb_elements); www.caps-entreprise.com 19 Settings for Kernel Launching • Set the NDRange (grid) geometry int global_work_size[2] = {nb_elements_x, nb_elements_y}; int local_work_size[2] = {16, 16}; • Task parallel model is used for CPU o General task : complex, independent, … • Data-parallel is used for GPU o Need to set a grid, NDRange in OpenCL www.caps-entreprise.com 20 Kernel Launch (1) • If task kernel o Use the queued task command cl_int clEnqueueTask(cl_command_queue command_queue, cl_kernel kernel, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) • If data-parallel kernel o Use the queued NDRange kernel command cl_int clEnqueueNDRangeKernel( cl_command_queue command_queue, cl_kernel kernel, cl_uint work_dim, const size_t *global_work_offset, const size_t *global_work_size, const size_t *local_work_size, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) www.caps-entreprise.com 21 Kernel Launch (2) • The launch of the kernel is asynchronous by default err = clEnqueueNDRangeKernel( cmd_queue, kernel[0], 2, NULL, &global_work_size[0], &local_work_size[0], 0, NULL, &evt); clFinish(cmd_queue); www.caps-entreprise.com 22 Copy Back the Results • About the same as the copy in o From device to host cl_int clEnqueueReadBuffer( cl_command_queue command_queue, cl_mem buffer, cl_bool blocking_read, size_t offset, size_t sizeinbytes, void *ptr, cl_uint num_events_in_wait_list, const cl_event *event_wait_list, cl_event *event) • Example err = clEnqueueReadBuffer(cmd_queue, res_mem, CL_TRUE, 0, size, (void*) mat_res, 0, NULL, NULL); clFinish(cmd_queue); www.caps-entreprise.com 23 Free Device’s Memory • At the end you need to release the allocated memory • Releasing the memory cl_int clReleaseMemObject(cl_mem memobj) • Example o Release matrix a on GPU o Release matrix res on GPU clReleaseMemObject(mat_a_gpu); clReleaseMemObject(mat_res_gpu); www.caps-entreprise.com 24 Release Objects • At the end, you must release o The programs o The kernels o The command queues o And the context cl_int clReleaseKernel (cl_kernel kernel) cl_int clReleaseProgram (cl_program program) cl_int clReleaseCommandQueue (cl_command_queue command_queue) cl_int clReleaseContext (cl_context context) www.caps-entreprise.com 25 Error Management • Do not forget to

Opencl – Openacc & Exascale

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support