Introduction to OpenCL Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation

RTS 2013

April 11th 2013 training This lecture is brought to you by

Bernard Dautrevaux Chief Technical Officer – Ac6 Alumni of École Normale Supérieure (ENS Cachan) Master of Mathematics

Master of Computer Science University of Paris, Orsay

Should you have any question later on feel free to contact me at Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation mailto:[email protected]

You could also find more information on our web site: http://www.ac6.fr or http://www.ac6-training.com

training Page 1 Agenda

 Why OpenCL?

 The Goals of OpenCL

 Overview

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  The OpenCL Language

 OpenCL Implementations

training Page 2 Why OpenCL? Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation

training Processor Parallelism

CPUs GPUs Multiple cores Increasingly general provides increased Emerging purpose data-parallel performance Intersection computing

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation Multiprocessor Heteregeneous Graphics and programming Computing Shading Languages (for example OpenMP)

OpenCL is a programming framework for heterogeneous computing resources training Page 4

Original graphics design from The The Origins of OpenCL

AMD Ericsson Merged, needed Nokia commonality across products IBM ATI Sony EA Freescale GPU vendor, wants TI Nvidia to steal market share from CPU vendors Wrote a rough draft, STM straw man API ...

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation CPU vendor, wants Intel to steal market share Khronos Compute from GPU vendors group formed Was tired of recoding for many Dec 2008 Apple core and GPUs; pushed vendors to standardize training Page 5 The OpenCL Working Group

 OpenCL means Open Computing Language  Diverse industry participation  Processor vendors, system OEMs, middleware vendors, application developers  Many industry-leading experts involved in OpenCL’s design  A healthy diversity of industry perspectives  Apple made initial proposal in 2008  Apple is very active in the working group  Serving as specification editor  The OpenCL standard is edited by The Khronos Group  Founded in January 2000 by a number of leading media-centric companies  3Dlabs, ATI, Discreet, Evans & Sutherland, Intel, NVIDIA, SGI, Sun Microsystems...  Now more than 100 members, including STMicroelectronics  Dedicated to creating open standard APIs to enable authoring and playback of rich media Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  The Khronos Group edit several standards:  OpenCL  OpenGL, OpenGL SC (Safety Critical), OpenGL ES (Embedded Systems): 3D graphics  OpenVG: 2D vectored graphics  OpenMAX AL, IL, DL: Streaming Media recording and playback, processing, codec  OpenKode, OpenWF, WebGL, Collada, EGL...  No one company “owns” the OpenCL standard, not even Apple  This ensures you can depend on it without fear

training Page 6 The OpenCL Timeline

 Six months from proposal to released OpenCL 1.0 specification  Due to a strong initial proposal and a shared commercial incentive  Multiple conformant implementations shipping  Apple’s Mac OS X ships with OpenCL since version Snow Leopard  18 month between OpenCL 1.0 (December 2008) and OpenCL 1.1 (June 2010)  Backwards compatibility protect software investment  18 months again before OpenCL 1.2 (released 15th November 2011)  Adds new features, like device partitioning, separate compilation, better OpenGL integration…  OpenCL is only a specification  It has to be implemented by “someone”  Apple, Intel, AMD, nVidia has implementations conforming to OpenCL-1.1 specification  Apple, AMD and Intel (beta) support OpenCL-1.2, but nVidia is still at 1.1 Multiple conformant Khronos publicly releases implementations OpenCL 1.0 as royalty-free ship across diverse Release of OpenCL 1.2

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation specification OS and platforms

Jun 08 May 09 Jun 10 Dec 08 Fall 09 Nov 11

Apple proposes OpenCL Khronos releases OpenCL OpenCL 1.1 Specification working group and 1.0 conformance tests to released and first contributes draft ensure high-quality implementations ship specification to Khronos implementations training Page 7

Original graphics design from The Khronos Group OpenCL and the OpenGL Ecosystem

Roadmap Convergence OpenGL 4.0 and OpenGL ES 2.0 Desktop Visual Computing are both streamlined, programmable pipelines. GL and ES working groups OpenGL and OpenCL have direct are working on convergence. WebGL interoperability. OpenCL objects can be created from OpenGL Textures, Buffer is a positive pressure for portable 3D content on all platforms Objects and Renderbuffers Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation

Mobile Visual Computing Compute, graphics and AV APIs interoperate through EGL

training Page 8

Original graphics design from The Khronos Group OpenCL: From Cell Phone to Super Computer

 OpenCL Embedded Profile for Mobile and Embedded silicon  Relaxes some data types and precision requirements  Avoids the need for a separate « ES » specification

 Khronos API provide computing support for imaging & graphics

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  Enabling advanced applications  For example in Augmented Reality A camera phone with GPS processes images to  OpenCL will enable parallel recognize buildings and landmarks and provides computing in new markets relevant data from internet

training Page 9 Goals of OpenCL Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation

training OpenCL Goals

 Provide a simple computing model  For high performance parallel computing  Based on the ISO C99 language  Some restrictions  Some extensions  Thread management framework  Application an thread synchronization  Easy to use  Needs to be lightweight and efficient  Powerful Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  Allow use of all computing resources  Portable and predictable  IEEE-754 compliant rounding behavior  Minimum accuracy for math functions  Provide guidelines for new hardware requirements

training Page 11 Uses of OpenCL

 Where OpenCL can be used:  Image, Video and audio processing  Simulations and scientific calculations  Medical imaging  Financial models  All data-parallel algorithms that are computationally intensive  There is many types of parallel computing  They can be categorized by granularity  From coarse to fine granularity:  Grid Computing

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  MPI/OpenMPI  OpenMP/pthreads  SIMD  OpenCL goal is to cover fine grained parallelism (threads, SIMD...)  OpenCL was designed to work seamlessly with OpenGL  When OpenCL works on a GPU, it can directly share data buffers with OpenGL

training Page 12 Task- vs Data-Parallel Computing

 Coarse-grain distribution in single system Task1 Task1 Task1  Task Parallelism

0 3 -7 5 -4 -1 3 -9 Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  Same (set of) operation(s) on multiple data independently: abs()  Data-Parallelism

0 3 7 5 4 1 3 9

OpenCL covers both task-parallelism and data-parallelism training Page 13 Data-parallelism

 The Box Filter  Create a new image  Compute the average at any “box”  Place it at the middle pixel on the new image  Operations can be done totally independently  Results are stored at another location  Computation is quite intensive Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation

training Page 14 Why putting the emphasis on GPUs?

 GPUs are crazy fast floating point number crunchers Core 2 Duo  They do almost only that  But do it insanely fast... • 150 W • 4 Cores  GPUs are designed for highly • 8.5 GiB/s scalable parallelism • 45 Gflop/s  They are simple ALUs  You can multiply them  GPU performance is increasing NVIDIA GTX 285 faster than CPU performance Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  It’s simple to add more small cores • 204 W  GPU bandwidth is much larger • 240 Cores • 159 GiB/s than CPU bandwidth • ~1 Tflop/s  Working in-card  Memory to ALU

training Page 15 Why GPU can be so fast?

 The GPU is specialized for  Compute-intensive, highly parallel, simple computations  +, -, *, / in fixed or single precision FP  no error checks, consistent rounding and saturation  This is exactly what graphics rendering is all about  Transistors are devoted to data processing rather than data caching and flow control  While general purpose CPUs must do the opposite

ALU ALU Control Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation ALU ALU

Cache

DRAM DRAM

CPU GPU training Page 16 GPU Limitations

 System to GPU bandwidth is 3-4GiB /s transfers = limited 1 byte/clock cycle

 GPUs are not very smart  They handle errors quite badly During the time needed to copy 16 bytes:

• 64 adds  GPU programs can be difficult to • 64 multiplies debug • Loaded 256 bytes into registers from L1 • Stored 256 bytes from registers into L1 Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation • Copied 16 vector registers to other registers  GPUs like (need?) their data specifically arranged  Otherwise efficiency degrades a lot

training Page 17 When not to use OpenCL

OpenCL is not designed for

 Sequential problems  If there is no inherent parallelism, OpenCL will not help  Example: Huffman decoding  Calculations that need  a lot of synchronization  a lot of pointer chasing or constant data permutations  a lot of communication and result updates

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  Device dependent limitations may also arise

Not everything will work well with OpenCL OpenCL was not design to solve all problems just highly parallelizable problems training Page 18 Overview Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation

training A Heterogeneous World

 A modern platform includes  One or more CPUs  Maybe multi-core CPU CPU  One or more GPUs  DSPs  ...  Performance increases are no more a clock issue GMCH  It is now obtained by multiple GPU cores ICH  Memory may not be uniformly Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation accessible (NUMA)

GMCH: Graphics and Memory Control Hub ICH: Input/output Control Hub

OpenCL is designed to let programmers write a single portable program that uses all resources of the platform

training Page 20

Original graphics design from The Khronos Group The Idea Behind OpenCL

 The OpenCL execution model  OpenCL executes a kernel at each point in a problem domain  E.g. To process a 1024 x 1024 image  Create one kernel invocation per pixel  That is 1024 x 1024 = 1,048,576 kernel executions  OpenCL treats all supported devices as peers

Traditional loops Data-parallel OpenCL void kernel void mul(int n, mul(global const float* a, const float* a, global const float* b, const float* b, Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation global float* c) float* c) { { int i = get_global_id(0); int i; for (i=0; i

training Page 21 A N-dimensional domain of work-items

 You must define the “best” N-dimensional index space for your algorithm  Global dimensions: 1024 x 1024 (whole problem space)  Local dimensions: 128 x 128 (work group, executes together)

1024 Synchronization between work-items: Possible only within the same workgroup Use barriers and memory fences Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation 1024 Cannot synchronize between work-items of two different workgroups

 Choose the dimensions that are « best » for your algorithm training Page 22 The OpenCL Platform Model

 One Host is connected to one or more Compute Devices (CD), made of  One or more Compute Units (CU), made of  One or more Processing Elements (PE), made of  One core  Some (optional) private memory  Some (optional) local memory  Some Global and Constants Memory Data Cache  Some Global and Constants Memory  OpenCL is a device agnostic, general purpose, portable, parallel programming system Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation

training Page 23

Original graphics from The Khronos Group The OpenCL Context

 An OpenCL context is:  The environment withing which kernels will execute  The domain in which synchronization and memory management is defined.  The context includes:  A set of devices,  The memory accessible to those devices  One or more command-queues used to schedule execution of a kernel(s) or operations on memory objects.  Contexts contain and manage the state of the “world” in OpenCL.

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  Kernel execution commands  Memory commands  transfer or mapping of memory object data  Synchronization commands  constrains the order of execution of commands

training Page 24 The OpenCL Execution Model

 An OpenCL application runs on a host which submits work to the compute devices Work item:  The basic unit of work on an OpenCL device Kernel:  The code for a work item.  Basically a C function Program:  Collection of kernels and other functions  Analogous to a dynamic library Context:  The environment within which work items executes …  Includes devices, their memories and command queues Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  Applications queue kernel execution instances  They are queued in-order …  One queue per device  They execute in-order or out-of-order  Depending on the queue

training Page 25

Original graphics from The Khronos Group OpenCL Kernels

 The OpenCL execution model:  Define a problem domain  Execute an instance of a kernel for each point in the domain

kernel void square( global float* input, global float* output) { int i = get_global_id(0); output[i] = input[i] * input[i]; }

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation 9  get_global_id(0) input

6 4 1 1 0 1 0 9 4 5 1 1 7 6 3 5 0 1 0 1 8 6 0 3 2 0 1 3 9 0 1 0

output

0 16 1 1 0 1 0 81 16 25 1 1 49 36 9 25 0 1 0 1 64 36 0 9 4 0 1 9 81 0 1 0

training Page 26 The OpenCL Memory Model

 Private Memory  Per work-item  Local Memory  Shared within a workgroup  Global/Constant Memory  Visible to all workgroups  Host Memory  On the main CPU Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation

Memory Management is Explicit You must move data from Host  Global  Local  Private and back Private  Local  Global  Host

training Page 27

Original graphics from The Khronos Group OpenCL Scalability

OpenCL Program

Block 0 Block 1 Block 2 Block 3

Block 4 Block 5 Block 6 Block 7

Dual Core Quad Core

Core 0 Core 1 BlockCore 00 BlockCore 11 BlockCore 22 BlockCore 33

BlockCore 00 BlockCore 11 BlockCore 04 BlockCore 15 BlockCore 26 BlockCore 37 Time Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation

BlockCore 02 BlockCore 13

BlockCore 04 BlockCore 15

BlockCore 06 BlockCore 17 training Page 28 The OpenCL Language Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation

training The OpenCL Language

 Kernels are programmed using the OpenCL Language  It is a subset of ISO C99  Some features are omitted, like  Standard C99 headers  Function pointers  Recursion  Variable length arrays  Bit fields  It is also a superset of ISO C99, with additions for  Workitems and workgroups definition  Vector types  Synchronization

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  Address space qualifiers  Also includes a large set of built-in functions  Image manipulation  Workitem manipulation  Specialized math routines  ...

training Page 30 OpenCL Language Highlights

 Function qualifiers  __kernel qualifier declares a function as a kernel  Kernels can call other kernel functions  Address space qualifiers  __global, __local, __constant, __private  Pointer kernel arguments must be declared with an address space qualifier  Work-item functions  Query work-item identifiers  get_work_dim(), get_global_id(), get_local_id(), get_group_id()

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  Synchronization functions  Barriers:  all work-items within a work-group must execute the barrier function before any work- item can continue  Memory fences  provides ordering between memory operations

training Page 31 Building Program Objects

 The program object encapsulate  A context  The program source or binary  A list of target devices and build options  The build process to create a program object  clCreateProgramWithSource()  clProgramCreateWithBinary()

kernel void Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation horizontal_reflect(read_only image2d_t src, Compile for CPU write_only image2d_t dst) CPU code { int x = get_global_id(0); // x-coord int y = get_global_id(1); // y-coord int width = get_image_width(src); float4 src_val = read_imagef(src, sampler, (int2)(width-1-x, y)); Compile for GPU write_imagef(dst, (int2)(x, y), src_val); GPU code }

training Page 32 OpenCL Summary Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation

training Page 33

Original graphics from The Khronos Group The OpenCL Embedded Profile

 Compared to the full specification, there are a number of notable differences

 Embedded profile devices do not support any 64 bit data.  This means that there are no double, ulong, and some vector data types

 Embedded profile devices do not have to support 3D operations.

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  IEEE 754 rounding requirement is not implemented strictly  Full IEEE 754 require the rounding to be to the nearest even number  Embedded profile supports only the basic mode of IEEE 754  all rounding can be done nearest to zero

training Page 34 OpenCL Implementations Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation

training OpenCL Implementations

 The OpenCL specification is open  However Implementations are usually proprietary  There is 4 main implementations available now  Apple  For Mac OS X Snow Leopard  NVIDIA  For CUDA-based GPGPUs  AMD (ATI)  For Radeon and newer boards  Intel  For recent Core2 chips (45nm node and below)  Mac OS X Snow Leopard Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  OpenCL is a System Framework of Mac OS X since version 10.6  It is available on all Mac OS X versions since then  It manages the multicore CPU as well as the Mac’s GPUs  The SDK is available in the Xcode environment  You thus do not have to install anything new

training Page 36 OpenCL Implementations (2/4)

 NVIDIA  NVIDIA supports OpenCL-1.1 on all its CUDA-compatible graphics boards  Starting from GeForce desktop products  Support is included in all recent drivers  An OpenGL SDK is available for registered developers  Registration is free  NVIDIA OpenCL support information is available at http://developer.nvidia.com/opencl  Installation  You must have Microsoft Visual Studio installed  Either Visual Studio 2005 or 2008  First check you graphics driver  First update to the latest official driver at http://www.nvidia.com/Download/index.aspx  If it is more recent than v258.19, it is the one to use  Otherwise, you will have to install a beta driver (see below)  Then you must download the latest OpenCL SDK

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  For this you must register as a NVIDIA/CUDA developer at http://developer.nvidia.com/join-nvidia-registered-developer-program  Then you can download the SDK from https://nvdeveloper.nvidia.com  Go to the OpenCL 1.1 Conformance Candidate page and download  The dev_driver_3.1_xxxx for your platform and install it if needed  You must have a valid graphics driver to continue  The gpucomputingsdk_1_1_xxxx for your platform  You can then install it  The GPU Computing SDK comes with source code and prebuilt samples  Although OpenCL sample code is no more bundled in recent SDKs (since fall 2012…)  References to OpenCL seems to be vanishing from the nVidia web site…

training Page 37 OpenCL Implementations (1/4)

 AMD (ATI)  AMD supports OpenCL-1.2 on most ATI graphics boards (Radeon, FirePro...)  Support is included in recent Catalyst drivers  An OpenGL SDK is included in AMD Advanced Parallel Programming (APP) SDK  The AMD APP SDK replaced the ATI Stream SDK on August 8 2011  AMD APP SDK can be found at http://developer.amd.com/tools/heterogeneous-computing/ amd-accelerated-parallel-processing-app-sdk/  Installation  You need a C Compiler; any of  Intel C Compiler 11.x or more recent  Microsoft Visual Studio Professional 2008 or 2010  Minimalist GNU for Windows (MinGW) with gcc-4.4

Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  A recent AMD Catalyst Driver  You can download it from http://support.amd.com/us/gpudownload  The AMD Advanced Parallel Programming APP SDK, from http://developer.amd.com/tools/heterogeneous-computing/ amd-accelerated-parallel-processing-app-sdk/downloads/  You can then install it  The AMD APP SDK comes with samples

training Page 38 OpenCL Implementations (1/4)

 Intel  Intel provides an OpenCL SDK for programming multicore CPUs  On the opposite of the Apple, NVIDIA and ATI SDKs it can work without a GPU  It needs at least a recent (45nm) Core 2 Duo processor  OpenCL-1.2 support is only provied for core i3 and more recent CPUs  GPUs are only supported on Windows  On Windows you will need Windows 7 and Visual Studio 2008 (2010 for OpenCL-1.2)  Installation  The OpenCL SDK can be obtained at http://software.intel.com/en-us/vcsource/tools/opencl-sdk  You may have to provide a valid email address to accept the license  The OpenCL-1.1 SDK is complete and self sufficient  It includes samples  The OpenCL-1.2 SDK (at opencl-sdk-2013) is still in beta  It needs at least a core i3 processor Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation  However, OpenCL application cannot execute directly  If you deploy it on another machine you must provide a run-time  It’s the equivalent of the GPU drivers you need when running on a GPU  It is named the OpenCL Installable Client Driver (ICD)  The OpenCL runtime can be downloaded at the same address  It only exist for Windows  For Linux you should distribute the (relevant part of) the SDK

training Page 39 Summary

In this chapter you have discovered

 Why we need a new language

 The Goals of OpenCL Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation

 The OpenCL Concepts

 The OpenCL Implementations

training Page 40 Questions? Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation Should you have any further question, feel free to contact us: mailto:[email protected] http://www.ac6-training.com http://www.ac6.fr

training