A Brief Introduction to OpenCL • Reference - Programming Massively Parallel Processors: A Hands-on Approach, David Kirk and Wen-mei W. Hwu, Chapter 11 • What is OpenCL? – It is a standardized, cross-platform, parallel-computing API – It is designed to be portable & work with heterogeneous systems – Unlike earlier models, such as OpenMP, OpenCL is designed to address complex memory hierarchies and SIMD computing – Having a more general standard means that it is also more complex; not all devices may support all features and it may be necessary to write adaptable code – In this brief introduction we will look at the data parallelism model and briefly see its application to the molecular visualization problem

Data Parallelism Model • There is a direct corres- pondence with CUDA • Host programs are used to launch kernels on OpenCL devices • The index space maps data to the work items • Work items are groups, as in Blocks with CUDA • Work items in the same group can be synchronized • The next slide shows a 2D NDRange (or index space), it is very similar to the CUDA model (except the Work group indices are in the expected order!)

Parallel Execution Model

Getting global and local values • Thead IDs and Sizes – The API calls and equivalent CUDA code is shown below for dimension 0 (the x dimension) – If the parameter is 1, it corresponds to the y dimension, and 2 for the z dimension

Device Architecture • The CPU is a traditional computer that exectures the host program • Here is an OpenCL device – It contains one or more compute units (CU) – Each CU contains one or more processing elements (PE) – There are a variety of memory types Memory Characteristics • Global memory – dynamically allocated by host, has read/write access by both host and devices • Constant memory – dynamically allocated by host (unlike CUDA), read/write by host and read-only by device; a query returns the maximum size supported by the device • Local memory – most closely corresponds to CUDA shared memory, can be dynamically allocated by the host and statically allocated by the device; cannot be accessed by the host (same as CUDA) but can be accessed by all workers in the work group • Private memory – corresponds to CUDA local memory Kernel Functions • Similarities with CUDA – __kernel corresponds to __global in CUDA – A vector add kernel is shown below; two input vectors a and b and one output vector result – All three vectors reside in global memory, the two inputs are const – This is a 1D problem so get_global_id(0) is used to get the index – The addition is performed as expected Device Management & Kernel Launch • Now for the “ugly” side of OpenCL – CUDA, which deals with a uniform device from one manufacturer, has hidden the details of launching apps – This is not possible in OpenCL which is designed for many widely varied devices from many manufacturers • An OpenCL context – Use clCreateContext() – Use clGetDeviceIDs() to find all devices – Create a command queue for each device – A sequence of function call are made to insert the kernel code with its execution parameters A “Simple” Example - 1 • Line by Line – Set error code to success – Call create context from type • Include all devices (param 2) • The last argument sets the error code – Line 3 declares parmsz, the size of the memory buffer – Line 4 is the first call to clGetContextInfo • clctx from line 2 is the first param • Param 4 is NULL since the size is not known • There will be another call in line 6 where the missing information is supplied A “Simple” Example - 2 • Line by Line – Line 5 uses malloc to assign to cldevs the address of the buffer – Call clGetContextInfo again in line 6 • The third param is set to parmsz • The fourth param is set to cldevs • The error code is returned – Line 7 creates a command queue for the first device • cldevs is treated as an array and the 2nd param is cldevs[0] • This generates a command queue for the first device in the list returned by clGetContextInfo Electrostatic Potential Map in OpenCL • Step 1: design the organization of NDRange – Threads are now work items; blocks are work groups – Each work item calculates up to eight grid points – Each work group has 64 to 256 work items Mapping DCS NDRange to OpenCL Device • The structure is the same, only the nomenclature is changed Changes in Data Access Indexing • The change are relatively minor – __global__ becomes __kernel – The access to the .x and .y items and arithmetic are replaced by method calls specifying dimension 0 and 1 The Inner Loop of the DCS kernel • The OpenCL code is shown – The logic is basically the same – _rsqrt() has been changed to native_rsqrt Building an OpenCL kernel – Line 1 – declares entire DCS kernel as a string – Line 3 – delivers source code string to the OpenCL run time system – Line 4 – sets up the compiler flags – Line 5 – invokes the runtime compiler to build program – Line 6 – handle to kernel that can be submitted to command queue Host Code for the kernel Launch - 1 Host Code for the kernel Launch - 2 – Lines 1 & 2 : allocate memory for energy grid and atoms – Lines 3 – 6 : sets up arguments to be passed to the kernel – Line 8 : submits the DCS kernel for launch – Lines 9-10 : check for errors, if any – Line 11 : transfers result data in energy array back to host memory – Lines 12-13 : releases memory