Introduction to OpenCL

Ezio Bartocci Vienna University of Technology Overview

• Overview of OpenCL for GPUs

• API and Languages

• Sample codes walkthrough

• OpenCL Information and Resources OpenCL – Open Language

• OpenCL is an Open, royalty-free -language extension

• It is a framework designed for parallel programming of heterogeneous systems using GPUs, CPUs, FPGA, DSP’s and other processors including embedded mobile devices

• It was initially introduced by Apple, now is supported by NVIDIA, , AMD, IBM….(that are in the OpenCL working group)

• Managed by OpenCL versions and history (1) OpenCL 1.0 (2008) • OpenCL 1.0 has been released with Mac OS X Snow Leopard

OpenCL 1.1 (2010) • The Khronos Group adds significant functionality for enhanced parallel programming flexibility, functionality, and performance including:

• New data types including 3-component vectors and additional image formats;

• Handling commands from multiple host threads and processing buffers across multiple devices;

• Operations on regions of a buffer including read, write and copy of 1D, 2D, or 3D rectangular regions; • • Enhanced use of events to drive and control command execution;

• Additional OpenCL built-in C functions such as integer clamp, shuffle, and asynchronous strided copies;

• Improved OpenGL interoperability through efficient sharing of images and buffers by linking OpenCL and OpenGL events. OpenCL versions and history (2) OpenCL 1.2 (2011) • Most notable features include: • Device partitioning: the ability to partition a device into sub-devices so that work assignments can be allocated to individual compute units. This is useful for reserving areas of the device to reduce latency for time-critical tasks.

• Separate compilation and linking of objects: the functionality to compile OpenCL into external libraries for inclusion into other programs.

• Enhanced image support: 1.2 adds support for 1D images and 1D/2D image arrays. Furthermore, the OpenGL sharing extensions now allow for OpenGL 1D textures and 1D/2D texture arrays to be used to create OpenCL images.

• Built-in kernels: custom devices that contain specific unique functionality are now integrated more closely into the OpenCL framework. Kernels can be called to use specialised or non-programmable aspects of underlying hardware. Examples include video encoding/decoding and digital signal processors.

• DirectX functionality: DX9 media surface sharing allows for efficient sharing between OpenCL and DX9 or DXVA media surfaces. Equally, for DX11, seamless sharing between OpenCL and DX11 surfaces is enabled. NVIDIA OpenCL Support

Operative Systems • Windows (XP, VISTA, 8) 32/64 bits • (Ubuntu, RHEL, etc.) 32/64 bits • Mac OSX Snow Leopard

IDE’s supported • GCC for Linux • Visual Studio for Windows

Drivers and JIT Compiler • They usually are provided with GPU drivers (i.e. CUDA drivers…)

NVIDIA SDK • It contains examples of applications, the specification, the programming manual and the best practices guide. OpenCL Language & API

Platform Layer API (called from the host) • It is an abstraction layer for diverse computational resources • Query, select and initialize compute devices • Create compute contexts and work-queues

Runtime API (called from the host) • Launch compute kernels • Set kernel execution configuration • Manage scheduling, compute, and memory resources

OpenCL Language • Write compute kernels that run on a compute device • C-based cross-platform programming interface • Subset of ISO C99 with language extensions • Include rich set of built-in functions • Can be compiled Just In Time(JIT) or offline OpenCL Programming Model OpenCL Programming Model

NDRange – N-Dimensional Range N can be 1, 2 or 3. it defines the global index space for each kernel instance.

OpenCL Programming Model

Work-item • A single kernel instance in the index space. • Each Work-item execute the same compute • Kernel but on different data • Work-items have unique global IDs from the Index space • It can be related to the concept of in CUDA OpenCL Programming Model

Work-group • Work-items are further grouped into Work Groups • Work-group have a unique Work-group ID • Work items have a unique local ID within a Work-Group • It can be related to the concept of Block of Threads in CUDA OpenCL Memory Model

Private Memory Work Group Work Group Read/Write access Private Private Private Private For Work-item only Memory Memory Memory Memory …….. Work-Item 1 Work-Item M Work-Item 1 Work-Item M Local Memory Read/Write access Compute Unit 1 Compute Unit N For enre Work Group

Local Memory Local Memory Constant Memory Read access Global/Constant Memory/ Data For enre ND-range Compute Device (e.g. GPU) All work-items, all work-groups Global Memory Global Memory Read/write access For enre ND-range Compute Device Memory All work-items, all work-groups Basic Program Structure

Host program • Create memory objects associated to contexts • Compile and create kernel program objects • Issue commands to command-queue • Synchronization of commands PLATFORM LAYER • Clean up OpenCL resources

• Query compute devices RUNTIME • Create contexts

Compute Kernel (runs on device) OpenCL Language • C code with some restrictions and extensions

Basic Program Structure

Buffer objects • 1D collection of objects (like C arrays) • Scalar & Vector types, and user-defined Structures • They are accessed via pointers in the

Image objects • 2D or 3D texture, frame-buffer, or images • Must be addressed through built-in functions

Sampler objects • Describe how to sample an image in the kernel • Addressing modes • Filtering modes

OpenCL Language Highlights

Function qualifiers • “__kernel” qualifier declares a function as a kernel

Address space qualifiers • “__global, __local, __constant, __private”

Work-item functions • get_work_dim() • get_global_id(), get_local_id(), get_group_id(), get_local_size()

Image functions • Image must be accessed through built-in functions • Reads/writes performed through sampler objects from host or defined in

Synchronization functions • Barriers – All work-items within a work-group must execute the barrier function before any work-item in the work-group can continue