Opencl Programming

OpenCL programming T-106.6200 Special course High-Performance Embedded Computing (HPEC) Autumn 2013 (I-II), Lecture 8 Vesa Hirvisalo 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Today • Introduction to OpenCL – Nature and history of the language • OpenCL programming – Programming model – Program structure • A short example – More in the exercises • Properties – Kernel language restrictions 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Introduction to OpenCL 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL • Low-level language for high-performance heterogeneous data-parallel computation – An open standard, mostly API-based • Access to all compute devices in your system – CPUs – GPUs – Other accelerators • Based on C99 – A well-known languae – A low level language • Portable across devices (but not trivial) – Vector intrinsics and math libraries – Guaranteed precision for operations 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL vs CUDA • Different background – CUDA has better tools, language, and features – OpenCL supports more devices • Different vendor connections – OpenCL: Apple etc. (but basically not vendor specific) – CUDA: NVIDIA (is vendor specific) • Basically the same – If you know basic of one, you know the other – OpenCL is not HW specific • More queries, setting, …, more verbose (more painful to code) • You are by default closer to the driver (you understand more) – Both reflect GPU architectures of approx. 2009 • The world is in a change… 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Applicability of OpenCL • Anything that is – Computationally intensive • Low intensity: lot of memory access, few (and cheap) computations ! !e.g., A[i] = B[j] • High intensity: costly computations, few memory accesses! ! !e.g., A[i] = exp(temp)*acos(temp)! – Data-parallel • Full independency between data items not required • Can cope with limited dependencies – E.g., streaming computations, stencil computations, … – Floating-point computations ((16)/32/64 bit, IEEE 754) • Can do integers, too • Note this is because OpenCL was designed for GPUs, and GPUs are good at these things 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL history • OpenCL 1.0 – Initial version (Apple) • OpenCL 1.1 (we will use this in exercises) – New data types, new regions commands, … • OpenCL 1.2 – Image support, built-in kernels, DirectX functionality, … • OpenCL 2.0 (not really here, yet…) – Shared virtual memory – Dynamic parallelism – Generic address space – C++11 atomics, pipes – Android installable client driver extension 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL programming 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL language (API) • Platform Layer API (called from the host) – Abstracts different devices (get the system set-up) – Query, select and initialize compute devices – Create compute contexts and work-queues • Runtime API (called from the host) – Run the GPU code (launch compute kernels) – Set kernel execution configuration – Manage resources: scheduling, compute, and memory • OpenCL Language (the kernel code) – Subset of ISO C99 with language extensions • Basically, the GPU limitations rule many features out • Must be able to compile and link • Suitable for JIT compilation (Just In Time) or offline compilation – Include rich set of built-in functions 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL programming model Source: Khronos 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL programming model • NDRange – N-Dimensional Range – N can be 1, 2 or 3. Defines the global index space for each kernel instance. • Work-item – A single kernel instance in the index space – Each Work‐item execute the same compute – Work‐items have unique global IDs from the Index space – It can be related to the concept of Thread in CUDA – Work‐items are further grouped into Work Groups • Work‐group – Work items have a unique local ID within a Work‐Group – It can be related to the concept of Block of Threads in CUDA 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL memory model Source: Khronos 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL memory model • Private Memory – Read/Write access – For Work‐item only • Local Memory – Read/Write access – For entire Work Group • Constant Memory – Read access – For entire ND‐range – All work‐items, all work‐groups • Global Memory – Read/write access – For entire ND‐range – All work‐items, all work‐groups 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL program flow • Host program – Create memory objects associated to contexts – Compile and create kernel program objects – Issue commands to command-queue – Synchronization of commands – Clean up OpenCL resources – Query compute devices – Create contexts • Compute Kernel (runs on device) – C code doing the computation 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL objects • Buffer objects – 1D collection of objects (like C arrays) – Scalar & Vector types, and user-defined Structures – Accessed via pointers in the compute kernel • Image objects – 2D or 3D texture, frame-buffer, or images – Must be addressed through built-in functions • Sampler objects – Describe how to sample an image in the kernel – Addressing modes – Filtering modes 2013-11-05 V. Hirvisalo ESG/CSE/Aalto An example 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Example (FFT): Host code (1/2) // create a compute context with GPU device context = clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // create a command queue clGetDeviceIDs( NULL, CL_DEVICE_TYPE_DEFAULT, 1, &device_id, NULL ); queue = clCreateCommandQueue(context, device_id, 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA, NULL); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL, NULL); // create the compute program program = clCreateProgramWithSource(context, 1, &fft1D_1024_kernel_src, NULL, NULL); // build the compute program executable clBuildProgram(program, 0, NULL, NULL, NULL, NULL); 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Example (FFT): Host code (2/2) // create the compute kernel kernel = clCreateKernel(program, "fft1D_1024", NULL); // set the args values clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjs[0]); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjs[1]); clSetKernelArg(kernel, 2, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, sizeof(float)*(local_work_size[0]+1)*16, NULL); // create N-D range object with work-item dimensions and execute kernel global_work_size[0] = num_entries; local_work_size[0] = 64; //Nvidia: 192 or 256 clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL); 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Example (FFT): Kernel code (1/2) __kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0); // local work item ID int blockIdx = get_group_id(0) * 1024 + tid; // add to global ID float2 data[16]; // 1024 size FFT based on radix 16 (and then 4) // starting index of data to/from global memory in = in + blockIdx; out = out + blockIdx; globalLoads(data, in, 64); // coalesced global reads fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 1024, 0); 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Example (FFT): Kernel code (2/2) // local shuffle using local memory localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15))); // four radix-4 function calls fftRadix4Pass(data); // radix-4 function number 1 fftRadix4Pass(data + 4); // radix-4 function number 2 fftRadix4Pass(data + 8); // radix-4 function number 3 fftRadix4Pass(data + 12); // radix-4 function number 4 // coalesced global writes globalStores(data, out, 64); } 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL properties 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Kernel code restrictions (1/3) • Kernel code has the following restrictions (some are or are going to be relaxed) – Pointers arguments must use the global, constant, or local qualifier. – An argument cannot be a pointer to a pointer(s) – Arguments to kernel functions cannot be declared with: • bool, half, size_t, ptrdiff_t, intptr_t, uintptr_t, or event_t. – The return type for a kernel function must be void. – Struct arguments cannot pass OpenCL objects as fields of the struct • E.g., buffers, images – Bit field struct members are not supported – No variable-length arrays and structures with flexible (or unsized) arrays – Variadic macros and functions are not supported – extern, static, auto, and register storage class specifiers are not supported – Predefined identifiers such as __func__ are not supported 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Kernel code restrictions (2/3) • And more… – Many library functions are disallowed • headers—assert.h, ctype.h, complex.h, errno.h, fenv.h, float.h, inttypes.h, limits.h, locale.h, setjmp.h, signal.h, stdarg.h, stdio.h, stdlib.h, string.h, tgmath.h, time.h, wchar.h, and wctype.h – Restricted use of many types • image2d_t and image3d_t – Only as arg, and not to be modified, etc. • sampler_t – Only as arg or top-level, etc. • event_t – Not as a kernel arg, etc. 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Kernel code restrictions (3/3) • Restriction tell a lot about the implementation – i.e., the underlying decide HW&SW support • Especially – Your cannot launch kernels within kernels • Relaxed if HW allows – Recursion is not allowed – The behavior of irreducible control flow in a kernel is implementation-defined, e.g. • goto jumping inside a nested loop • Run-through-style switch structures – Duff's device 2013-11-05 V. Hirvisalo ESG/CSE/Aalto References • Khronos OpenCL tutorials, reference card, etc: – http://www.khronos.org/opencl – http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf • To be used in the practical exercises – http://www.khronos.org/registry/cl/specs/opencl-2.0-openclc.pdf • The newest (provisional) specification • AMD OpenCL Resources – http://developer.amd.com/opencl • NVIDIA OpenCL Resources – http://developer.nvidia.com/opencl 2013-11-05 V. Hirvisalo ESG/CSE/Aalto .

Opencl Programming

Lecture 7 CUDA

ATI Radeon™ HD 4870 Computation Highlights

AMD Accelerated Parallel Processing Opencl Programming Guide

AMD Opencl User Guide.)

Novel Methodologies for Predictable CPU-To-GPU Command Offloading

Adaptive GPU Tessellation with Compute Shaders Jad Khoury, Jonathan Dupuy, and Christophe Riccio

Near Data Processing: Are We There Yet?

Integrated Framework for Heterogeneous Embedded Platforms Using Opencl Author: Kulin Seth Department: Electrical and Computer Engineering

Virtual GPU Software User Guide Is Organized As Follows: ‣ This Chapter Introduces the Capabilities and Features of NVIDIA Vgpu Software

Lecture 7 Today's Content Trends

Compute & Memory Optimizations for High-Quality Speech Recognition on Low-End GPU Processors

The Compute Architecture of Intel® Processor Graphics Gen8