A Brief Introduction to Opencl • Reference - Programming Massively Parallel Processors: a Hands-On Approach, David Kirk and Wen-Mei W

A Brief Introduction to OpenCL • Reference - Programming Massively Parallel Processors: A Hands-on Approach, David Kirk and Wen-mei W. Hwu, Chapter 11 • What is OpenCL? – It is a standardized, cross-platform, parallel-computing API – It is designed to be portable & work with heterogeneous systems – Unlike earlier models, such as OpenMP, OpenCL is designed to address complex memory hierarchies and SIMD computing – Having a more general standard means that it is also more complex; not all devices may support all features and it may be necessary to write adaptable code – In this brief introduction we will look at the data parallelism model and briefly see its application to the molecular visualization problem Data Parallelism Model • There is a direct corres- pondence with CUDA • Host programs are used to launch kernels on OpenCL devices • The index space maps data to the work items • Work items are groups, as in Blocks with CUDA • Work items in the same group can be synchronized • The next slide shows a 2D NDRange (or index space), it is very similar to the CUDA model (except the Work group indices are in the expected order!) Parallel Execution Model Getting global and local values • Thead IDs and Sizes – The API calls and equivalent CUDA code is shown below for dimension 0 (the x dimension) – If the parameter is 1, it corresponds to the y dimension, and 2 for the z dimension Device Architecture • The CPU is a traditional computer that exectures the host program • Here is an OpenCL device – It contains one or more compute units (CU) – Each CU contains one or more processing elements (PE) – There are a variety of memory types Memory Characteristics • Global memory – dynamically allocated by host, has read/write access by both host and devices • Constant memory – dynamically allocated by host (unlike CUDA), read/write by host and read-only by device; a query returns the maximum size supported by the device • Local memory – most closely corresponds to CUDA shared memory, can be dynamically allocated by the host and statically allocated by the device; cannot be accessed by the host (same as CUDA) but can be accessed by all workers in the work group • Private memory – corresponds to CUDA local memory Kernel Functions • Similarities with CUDA – __kernel corresponds to __global in CUDA – A vector add kernel is shown below; two input vectors a and b and one output vector result – All three vectors reside in global memory, the two inputs are const – This is a 1D problem so get_global_id(0) is used to get the thread index – The addition is performed as expected Device Management & Kernel Launch • Now for the “ugly” side of OpenCL – CUDA, which deals with a uniform device from one manufacturer, has hidden the details of launching apps – This is not possible in OpenCL which is designed for many widely varied devices from many manufacturers • An OpenCL context – Use clCreateContext() – Use clGetDeviceIDs() to find all devices – Create a command queue for each device – A sequence of function call are made to insert the kernel code with its execution parameters A “Simple” Example - 1 • Line by Line – Set error code to success – Call create context from type • Include all devices (param 2) • The last argument sets the error code – Line 3 declares parmsz, the size of the memory buffer – Line 4 is the first call to clGetContextInfo • clctx from line 2 is the first param • Param 4 is NULL since the size is not known • There will be another call in line 6 where the missing information is supplied A “Simple” Example - 2 • Line by Line – Line 5 uses malloc to assign to cldevs the address of the buffer – Call clGetContextInfo again in line 6 • The third param is set to parmsz • The fourth param is set to cldevs • The error code is returned – Line 7 creates a command queue for the first device • cldevs is treated as an array and the 2nd param is cldevs[0] • This generates a command queue for the first device in the list returned by clGetContextInfo Electrostatic Potential Map in OpenCL • Step 1: design the organization of NDRange – Threads are now work items; blocks are work groups – Each work item calculates up to eight grid points – Each work group has 64 to 256 work items Mapping DCS NDRange to OpenCL Device • The structure is the same, only the nomenclature is changed Changes in Data Access Indexing • The change are relatively minor – __global__ becomes __kernel – The access to the .x and .y items and arithmetic are replaced by method calls specifying dimension 0 and 1 The Inner Loop of the DCS kernel • The OpenCL code is shown – The logic is basically the same – _rsqrt() has been changed to native_rsqrt Building an OpenCL kernel – Line 1 – declares entire DCS kernel as a string – Line 3 – delivers source code string to the OpenCL run time system – Line 4 – sets up the compiler flags – Line 5 – invokes the runtime compiler to build program – Line 6 – handle to kernel that can be submitted to command queue Host Code for the kernel Launch - 1 Host Code for the kernel Launch - 2 – Lines 1 & 2 : allocate memory for energy grid and atoms – Lines 3 – 6 : sets up arguments to be passed to the kernel – Line 8 : submits the DCS kernel for launch – Lines 9-10 : check for errors, if any – Line 11 : transfers result data in energy array back to host memory – Lines 12-13 : releases memory .

A Brief Introduction to Opencl • Reference - Programming Massively Parallel Processors: a Hands-On Approach, David Kirk and Wen-Mei W

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support