Introduction to OpenCL
Ezio Bartocci Vienna University of Technology Overview
• Overview of OpenCL for NVIDIA GPUs
• API and Languages
• Sample codes walkthrough
• OpenCL Information and Resources OpenCL – Open Computing Language
• OpenCL is an Open, royalty-free C-language extension
• It is a framework designed for parallel programming of heterogeneous systems using GPUs, CPUs, FPGA, DSP’s and other processors including embedded mobile devices
• It was initially introduced by Apple, now is supported by NVIDIA, Intel, AMD, IBM….(that are in the OpenCL working group)
• Managed by Khronos Group OpenCL versions and history (1) OpenCL 1.0 (2008) • OpenCL 1.0 has been released with Mac OS X Snow Leopard
OpenCL 1.1 (2010) • The Khronos Group adds significant functionality for enhanced parallel programming flexibility, functionality, and performance including:
• New data types including 3-component vectors and additional image formats;
• Handling commands from multiple host threads and processing buffers across multiple devices;
• Operations on regions of a buffer including read, write and copy of 1D, 2D, or 3D rectangular regions; • • Enhanced use of events to drive and control command execution;
• Additional OpenCL built-in C functions such as integer clamp, shuffle, and asynchronous strided copies;
• Improved OpenGL interoperability through efficient sharing of images and buffers by linking OpenCL and OpenGL events. OpenCL versions and history (2) OpenCL 1.2 (2011) • Most notable features include: • Device partitioning: the ability to partition a device into sub-devices so that work assignments can be allocated to individual compute units. This is useful for reserving areas of the device to reduce latency for time-critical tasks.
• Separate compilation and linking of objects: the functionality to compile OpenCL into external libraries for inclusion into other programs.
• Enhanced image support: 1.2 adds support for 1D images and 1D/2D image arrays. Furthermore, the OpenGL sharing extensions now allow for OpenGL 1D textures and 1D/2D texture arrays to be used to create OpenCL images.
• Built-in kernels: custom devices that contain specific unique functionality are now integrated more closely into the OpenCL framework. Kernels can be called to use specialised or non-programmable aspects of underlying hardware. Examples include video encoding/decoding and digital signal processors.
• DirectX functionality: DX9 media surface sharing allows for efficient sharing between OpenCL and DX9 or DXVA media surfaces. Equally, for DX11, seamless sharing between OpenCL and DX11 surfaces is enabled. NVIDIA OpenCL Support
Operative Systems • Windows (XP, VISTA, 8) 32/64 bits • Linux (Ubuntu, RHEL, etc.) 32/64 bits • Mac OSX Snow Leopard
IDE’s supported • GCC for Linux • Visual Studio for Windows
Drivers and JIT Compiler • They usually are provided with GPU drivers (i.e. CUDA drivers…)
NVIDIA SDK • It contains examples of applications, the specification, the programming manual and the best practices guide. OpenCL Language & API
Platform Layer API (called from the host) • It is an abstraction layer for diverse computational resources • Query, select and initialize compute devices • Create compute contexts and work-queues
Runtime API (called from the host) • Launch compute kernels • Set kernel execution configuration • Manage scheduling, compute, and memory resources
OpenCL Language • Write compute kernels that run on a compute device • C-based cross-platform programming interface • Subset of ISO C99 with language extensions • Include rich set of built-in functions • Can be compiled Just In Time(JIT) or offline OpenCL Programming Model OpenCL Programming Model
NDRange – N-Dimensional Range N can be 1, 2 or 3. it defines the global index space for each kernel instance.
OpenCL Programming Model
Work-item • A single kernel instance in the index space. • Each Work-item execute the same compute • Kernel but on different data • Work-items have unique global IDs from the Index space • It can be related to the concept of Thread in CUDA OpenCL Programming Model
Work-group • Work-items are further grouped into Work Groups • Work-group have a unique Work-group ID • Work items have a unique local ID within a Work-Group • It can be related to the concept of Block of Threads in CUDA OpenCL Memory Model
Private Memory Work Group Work Group Read/Write access Private Private Private Private For Work-item only Memory Memory Memory Memory …….. Work-Item 1 Work-Item M Work-Item 1 Work-Item M Local Memory Read/Write access Compute Unit 1 Compute Unit N For en re Work Group
Local Memory Local Memory Constant Memory Read access Global/Constant Memory/ Data Cache For en re ND-range Compute Device (e.g. GPU) All work-items, all work-groups Global Memory Global Memory Read/write access For en re ND-range Compute Device Memory All work-items, all work-groups Basic Program Structure
Host program • Create memory objects associated to contexts • Compile and create kernel program objects • Issue commands to command-queue • Synchronization of commands PLATFORM LAYER • Clean up OpenCL resources
• Query compute devices RUNTIME • Create contexts
Compute Kernel (runs on device) OpenCL Language • C code with some restrictions and extensions
Basic Program Structure
Buffer objects • 1D collection of objects (like C arrays) • Scalar & Vector types, and user-defined Structures • They are accessed via pointers in the compute kernel
Image objects • 2D or 3D texture, frame-buffer, or images • Must be addressed through built-in functions
Sampler objects • Describe how to sample an image in the kernel • Addressing modes • Filtering modes
OpenCL Language Highlights
Function qualifiers • “__kernel” qualifier declares a function as a kernel
Address space qualifiers • “__global, __local, __constant, __private”
Work-item functions • get_work_dim() • get_global_id(), get_local_id(), get_group_id(), get_local_size()
Image functions • Image must be accessed through built-in functions • Reads/writes performed through sampler objects from host or defined in source
Synchronization functions • Barriers – All work-items within a work-group must execute the barrier function before any work-item in the work-group can continue