Opencl Programming

Opencl Programming

OpenCL programming T-106.6200 Special course High-Performance Embedded Computing (HPEC) Autumn 2013 (I-II), Lecture 8 Vesa Hirvisalo 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Today • Introduction to OpenCL – Nature and history of the language • OpenCL programming – Programming model – Program structure • A short example – More in the exercises • Properties – Kernel language restrictions 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Introduction to OpenCL 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL • Low-level language for high-performance heterogeneous data-parallel computation – An open standard, mostly API-based • Access to all compute devices in your system – CPUs – GPUs – Other accelerators • Based on C99 – A well-known languae – A low level language • Portable across devices (but not trivial) – Vector intrinsics and math libraries – Guaranteed precision for operations 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL vs CUDA • Different background – CUDA has better tools, language, and features – OpenCL supports more devices • Different vendor connections – OpenCL: Apple etc. (but basically not vendor specific) – CUDA: NVIDIA (is vendor specific) • Basically the same – If you know basic of one, you know the other – OpenCL is not HW specific • More queries, setting, …, more verbose (more painful to code) • You are by default closer to the driver (you understand more) – Both reflect GPU architectures of approx. 2009 • The world is in a change… 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Applicability of OpenCL • Anything that is – Computationally intensive • Low intensity: lot of memory access, few (and cheap) computations ! !e.g., A[i] = B[j] • High intensity: costly computations, few memory accesses! ! !e.g., A[i] = exp(temp)*acos(temp)! – Data-parallel • Full independency between data items not required • Can cope with limited dependencies – E.g., streaming computations, stencil computations, … – Floating-point computations ((16)/32/64 bit, IEEE 754) • Can do integers, too • Note this is because OpenCL was designed for GPUs, and GPUs are good at these things 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL history • OpenCL 1.0 – Initial version (Apple) • OpenCL 1.1 (we will use this in exercises) – New data types, new regions commands, … • OpenCL 1.2 – Image support, built-in kernels, DirectX functionality, … • OpenCL 2.0 (not really here, yet…) – Shared virtual memory – Dynamic parallelism – Generic address space – C++11 atomics, pipes – Android installable client driver extension 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL programming 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL language (API) • Platform Layer API (called from the host) – Abstracts different devices (get the system set-up) – Query, select and initialize compute devices – Create compute contexts and work-queues • Runtime API (called from the host) – Run the GPU code (launch compute kernels) – Set kernel execution configuration – Manage resources: scheduling, compute, and memory • OpenCL Language (the kernel code) – Subset of ISO C99 with language extensions • Basically, the GPU limitations rule many features out • Must be able to compile and link • Suitable for JIT compilation (Just In Time) or offline compilation – Include rich set of built-in functions 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL programming model Source: Khronos 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL programming model • NDRange – N-Dimensional Range – N can be 1, 2 or 3. Defines the global index space for each kernel instance. • Work-item – A single kernel instance in the index space – Each Work‐item execute the same compute – Work‐items have unique global IDs from the Index space – It can be related to the concept of Thread in CUDA – Work‐items are further grouped into Work Groups • Work‐group – Work items have a unique local ID within a Work‐Group – It can be related to the concept of Block of Threads in CUDA 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL memory model Source: Khronos 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL memory model • Private Memory – Read/Write access – For Work‐item only • Local Memory – Read/Write access – For entire Work Group • Constant Memory – Read access – For entire ND‐range – All work‐items, all work‐groups • Global Memory – Read/write access – For entire ND‐range – All work‐items, all work‐groups 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL program flow • Host program – Create memory objects associated to contexts – Compile and create kernel program objects – Issue commands to command-queue – Synchronization of commands – Clean up OpenCL resources – Query compute devices – Create contexts • Compute Kernel (runs on device) – C code doing the computation 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL objects • Buffer objects – 1D collection of objects (like C arrays) – Scalar & Vector types, and user-defined Structures – Accessed via pointers in the compute kernel • Image objects – 2D or 3D texture, frame-buffer, or images – Must be addressed through built-in functions • Sampler objects – Describe how to sample an image in the kernel – Addressing modes – Filtering modes 2013-11-05 V. Hirvisalo ESG/CSE/Aalto An example 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Example (FFT): Host code (1/2) // create a compute context with GPU device context = clCreateContextFromType(NULL, CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); // create a command queue clGetDeviceIDs( NULL, CL_DEVICE_TYPE_DEFAULT, 1, &device_id, NULL ); queue = clCreateCommandQueue(context, device_id, 0, NULL); // allocate the buffer memory objects memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, sizeof(float)*2*num_entries, srcA, NULL); memobjs[1] = clCreateBuffer(context, CL_MEM_READ_WRITE, sizeof(float)*2*num_entries, NULL, NULL); // create the compute program program = clCreateProgramWithSource(context, 1, &fft1D_1024_kernel_src, NULL, NULL); // build the compute program executable clBuildProgram(program, 0, NULL, NULL, NULL, NULL); 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Example (FFT): Host code (2/2) // create the compute kernel kernel = clCreateKernel(program, "fft1D_1024", NULL); // set the args values clSetKernelArg(kernel, 0, sizeof(cl_mem), (void *)&memobjs[0]); clSetKernelArg(kernel, 1, sizeof(cl_mem), (void *)&memobjs[1]); clSetKernelArg(kernel, 2, sizeof(float)*(local_work_size[0]+1)*16, NULL); clSetKernelArg(kernel, 3, sizeof(float)*(local_work_size[0]+1)*16, NULL); // create N-D range object with work-item dimensions and execute kernel global_work_size[0] = num_entries; local_work_size[0] = 64; //Nvidia: 192 or 256 clEnqueueNDRangeKernel(queue, kernel, 1, NULL, global_work_size, local_work_size, 0, NULL, NULL); 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Example (FFT): Kernel code (1/2) __kernel void fft1D_1024 (__global float2 *in, __global float2 *out, __local float *sMemx, __local float *sMemy) { int tid = get_local_id(0); // local work item ID int blockIdx = get_group_id(0) * 1024 + tid; // add to global ID float2 data[16]; // 1024 size FFT based on radix 16 (and then 4) // starting index of data to/from global memory in = in + blockIdx; out = out + blockIdx; globalLoads(data, in, 64); // coalesced global reads fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 1024, 0); 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Example (FFT): Kernel code (2/2) // local shuffle using local memory localShuffle(data, sMemx, sMemy, tid, (((tid & 15) * 65) + (tid >> 4))); fftRadix16Pass(data); // in-place radix-16 pass twiddleFactorMul(data, tid, 64, 4); // twiddle factor multiplication localShuffle(data, sMemx, sMemy, tid, (((tid >> 4) * 64) + (tid & 15))); // four radix-4 function calls fftRadix4Pass(data); // radix-4 function number 1 fftRadix4Pass(data + 4); // radix-4 function number 2 fftRadix4Pass(data + 8); // radix-4 function number 3 fftRadix4Pass(data + 12); // radix-4 function number 4 // coalesced global writes globalStores(data, out, 64); } 2013-11-05 V. Hirvisalo ESG/CSE/Aalto OpenCL properties 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Kernel code restrictions (1/3) • Kernel code has the following restrictions (some are or are going to be relaxed) – Pointers arguments must use the global, constant, or local qualifier. – An argument cannot be a pointer to a pointer(s) – Arguments to kernel functions cannot be declared with: • bool, half, size_t, ptrdiff_t, intptr_t, uintptr_t, or event_t. – The return type for a kernel function must be void. – Struct arguments cannot pass OpenCL objects as fields of the struct • E.g., buffers, images – Bit field struct members are not supported – No variable-length arrays and structures with flexible (or unsized) arrays – Variadic macros and functions are not supported – extern, static, auto, and register storage class specifiers are not supported – Predefined identifiers such as __func__ are not supported 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Kernel code restrictions (2/3) • And more… – Many library functions are disallowed • headers—assert.h, ctype.h, complex.h, errno.h, fenv.h, float.h, inttypes.h, limits.h, locale.h, setjmp.h, signal.h, stdarg.h, stdio.h, stdlib.h, string.h, tgmath.h, time.h, wchar.h, and wctype.h – Restricted use of many types • image2d_t and image3d_t – Only as arg, and not to be modified, etc. • sampler_t – Only as arg or top-level, etc. • event_t – Not as a kernel arg, etc. 2013-11-05 V. Hirvisalo ESG/CSE/Aalto Kernel code restrictions (3/3) • Restriction tell a lot about the implementation – i.e., the underlying decide HW&SW support • Especially – Your cannot launch kernels within kernels • Relaxed if HW allows – Recursion is not allowed – The behavior of irreducible control flow in a kernel is implementation-defined, e.g. • goto jumping inside a nested loop • Run-through-style switch structures – Duff's device 2013-11-05 V. Hirvisalo ESG/CSE/Aalto References • Khronos OpenCL tutorials, reference card, etc: – http://www.khronos.org/opencl – http://www.khronos.org/registry/cl/specs/opencl-1.1.pdf • To be used in the practical exercises – http://www.khronos.org/registry/cl/specs/opencl-2.0-openclc.pdf • The newest (provisional) specification • AMD OpenCL Resources – http://developer.amd.com/opencl • NVIDIA OpenCL Resources – http://developer.nvidia.com/opencl 2013-11-05 V. Hirvisalo ESG/CSE/Aalto .

View Full Text

Details

  • File Type
    pdf
  • Upload Time
    -
  • Content Languages
    English
  • Upload User
    Anonymous/Not logged-in
  • File Pages
    25 Page
  • File Size
    -

Download

Channel Download Status
Express Download Enable

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

  • Not to be reproduced or distributed without explicit permission.
  • Not used for commercial purposes outside of approved use cases.
  • Not used to infringe on the rights of the original creators.
  • If you believe any content infringes your copyright, please contact us immediately.

Support

For help with questions, suggestions, or problems, please contact us