Introduction to OpenCL Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation
RTS 2013
April 11th 2013 training This lecture is brought to you by
Bernard Dautrevaux Chief Technical Officer – Ac6 Alumni of École Normale Supérieure (ENS Cachan) Master of Mathematics
Master of Computer Science University of Paris, Orsay
Should you have any question later on feel free to contact me at Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation mailto:[email protected]
You could also find more information on our web site: http://www.ac6.fr or http://www.ac6-training.com
training Page 1 Agenda
Why OpenCL?
The Goals of OpenCL
Overview
Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation The OpenCL Language
OpenCL Implementations
training Page 2 Why OpenCL? Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation
training Processor Parallelism
CPUs GPUs Multiple cores Increasingly general provides increased Emerging purpose data-parallel performance Intersection computing
Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation Multiprocessor Heteregeneous Graphics APIs and programming Computing Shading Languages (for example OpenMP)
OpenCL is a programming framework for heterogeneous computing resources training Page 4
Original graphics design from The Khronos Group The Origins of OpenCL
AMD Ericsson Merged, needed Nokia commonality across products IBM ATI Sony EA Freescale GPU vendor, wants TI Nvidia to steal market share from CPU vendors Wrote a rough draft, STM straw man API ...
Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation CPU vendor, wants Intel to steal market share Khronos Compute from GPU vendors group formed Was tired of recoding for many Dec 2008 Apple core and GPUs; pushed vendors to standardize training Page 5 The OpenCL Working Group
OpenCL means Open Computing Language Diverse industry participation Processor vendors, system OEMs, middleware vendors, application developers Many industry-leading experts involved in OpenCL’s design A healthy diversity of industry perspectives Apple made initial proposal in 2008 Apple is very active in the working group Serving as specification editor The OpenCL standard is edited by The Khronos Group Founded in January 2000 by a number of leading media-centric companies 3Dlabs, ATI, Discreet, Evans & Sutherland, Intel, NVIDIA, SGI, Sun Microsystems... Now more than 100 members, including STMicroelectronics Dedicated to creating open standard APIs to enable authoring and playback of rich media Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation The Khronos Group edit several standards: OpenCL OpenGL, OpenGL SC (Safety Critical), OpenGL ES (Embedded Systems): 3D graphics OpenVG: 2D vectored graphics OpenMAX AL, IL, DL: Streaming Media recording and playback, processing, codec OpenKode, OpenWF, WebGL, Collada, EGL... No one company “owns” the OpenCL standard, not even Apple This ensures you can depend on it without fear
training Page 6 The OpenCL Timeline
Six months from proposal to released OpenCL 1.0 specification Due to a strong initial proposal and a shared commercial incentive Multiple conformant implementations shipping Apple’s Mac OS X ships with OpenCL since version Snow Leopard 18 month between OpenCL 1.0 (December 2008) and OpenCL 1.1 (June 2010) Backwards compatibility protect software investment 18 months again before OpenCL 1.2 (released 15th November 2011) Adds new features, like device partitioning, separate compilation, better OpenGL integration… OpenCL is only a specification It has to be implemented by “someone” Apple, Intel, AMD, nVidia has implementations conforming to OpenCL-1.1 specification Apple, AMD and Intel (beta) support OpenCL-1.2, but nVidia is still at 1.1 Multiple conformant Khronos publicly releases implementations OpenCL 1.0 as royalty-free ship across diverse Release of OpenCL 1.2
Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation specification OS and platforms
Jun 08 May 09 Jun 10 Dec 08 Fall 09 Nov 11
Apple proposes OpenCL Khronos releases OpenCL OpenCL 1.1 Specification working group and 1.0 conformance tests to released and first contributes draft ensure high-quality implementations ship specification to Khronos implementations training Page 7
Original graphics design from The Khronos Group OpenCL and the OpenGL Ecosystem
Roadmap Convergence OpenGL 4.0 and OpenGL ES 2.0 Desktop Visual Computing are both streamlined, programmable pipelines. GL and ES working groups OpenGL and OpenCL have direct are working on convergence. WebGL interoperability. OpenCL objects can be created from OpenGL Textures, Buffer is a positive pressure for portable 3D content on all platforms Objects and Renderbuffers Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation
Mobile Visual Computing Compute, graphics and AV APIs interoperate through EGL
training Page 8
Original graphics design from The Khronos Group OpenCL: From Cell Phone to Super Computer
OpenCL Embedded Profile for Mobile and Embedded silicon Relaxes some data types and precision requirements Avoids the need for a separate « ES » specification
Khronos API provide computing support for imaging & graphics
Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation Enabling advanced applications For example in Augmented Reality A camera phone with GPS processes images to OpenCL will enable parallel recognize buildings and landmarks and provides computing in new markets relevant data from internet
training Page 9 Goals of OpenCL Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation
training OpenCL Goals
Provide a simple computing model For high performance parallel computing Based on the ISO C99 language Some restrictions Some extensions Thread management framework Application an thread synchronization Easy to use Needs to be lightweight and efficient Powerful Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation Allow use of all computing resources Portable and predictable IEEE-754 compliant rounding behavior Minimum accuracy for math functions Provide guidelines for new hardware requirements
training Page 11 Uses of OpenCL
Where OpenCL can be used: Image, Video and audio processing Simulations and scientific calculations Medical imaging Financial models All data-parallel algorithms that are computationally intensive There is many types of parallel computing They can be categorized by granularity From coarse to fine granularity: Grid Computing
Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation MPI/OpenMPI OpenMP/pthreads SIMD OpenCL goal is to cover fine grained parallelism (threads, SIMD...) OpenCL was designed to work seamlessly with OpenGL When OpenCL works on a GPU, it can directly share data buffers with OpenGL
training Page 12 Task- vs Data-Parallel Computing
Coarse-grain distribution in single system Task1 Task1 Task1 Task Parallelism
0 3 -7 5 -4 -1 3 -9 Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation Same (set of) operation(s) on multiple data independently: abs() Data-Parallelism
0 3 7 5 4 1 3 9
OpenCL covers both task-parallelism and data-parallelism training Page 13 Data-parallelism
The Box Filter Create a new image Compute the average at any “box” Place it at the middle pixel on the new image Operations can be done totally independently Results are stored at another location Computation is quite intensive Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation
training Page 14 Why putting the emphasis on GPUs?
GPUs are crazy fast floating point number crunchers Core 2 Duo They do almost only that But do it insanely fast... • 150 W • 4 Cores GPUs are designed for highly • 8.5 GiB/s scalable parallelism • 45 Gflop/s They are simple ALUs You can multiply them GPU performance is increasing NVIDIA GTX 285 faster than CPU performance Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation It’s simple to add more small cores • 204 W GPU bandwidth is much larger • 240 Cores • 159 GiB/s than CPU bandwidth • ~1 Tflop/s Working in-card Memory to ALU
training Page 15 Why GPU can be so fast?
The GPU is specialized for Compute-intensive, highly parallel, simple computations +, -, *, / in fixed or single precision FP no error checks, consistent rounding and saturation This is exactly what graphics rendering is all about Transistors are devoted to data processing rather than data caching and flow control While general purpose CPUs must do the opposite
ALU ALU Control Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation ALU ALU
Cache
DRAM DRAM
CPU GPU training Page 16 GPU Limitations
System to GPU bandwidth is 3-4GiB /s transfers = limited 1 byte/clock cycle
GPUs are not very smart They handle errors quite badly During the time needed to copy 16 bytes:
• 64 adds GPU programs can be difficult to • 64 multiplies debug • Loaded 256 bytes into registers from L1 • Stored 256 bytes from registers into L1 Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation • Copied 16 vector registers to other registers GPUs like (need?) their data specifically arranged Otherwise efficiency degrades a lot
training Page 17 When not to use OpenCL
OpenCL is not designed for
Sequential problems If there is no inherent parallelism, OpenCL will not help Example: Huffman decoding Calculations that need a lot of synchronization a lot of pointer chasing or constant data permutations a lot of communication and result updates
Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation Device dependent limitations may also arise
Not everything will work well with OpenCL OpenCL was not design to solve all problems just highly parallelizable problems training Page 18 Overview Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation
training A Heterogeneous World
A modern platform includes One or more CPUs Maybe multi-core CPU CPU One or more GPUs DSPs ... Performance increases are no more a clock issue GMCH It is now obtained by multiple GPU cores ICH Memory may not be uniformly Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation accessible (NUMA)
GMCH: Graphics and Memory Control Hub ICH: Input/output Control Hub
OpenCL is designed to let programmers write a single portable program that uses all resources of the platform
training Page 20
Original graphics design from The Khronos Group The Idea Behind OpenCL
The OpenCL execution model OpenCL executes a kernel at each point in a problem domain E.g. To process a 1024 x 1024 image Create one kernel invocation per pixel That is 1024 x 1024 = 1,048,576 kernel executions OpenCL treats all supported devices as peers
Traditional loops Data-parallel OpenCL void kernel void mul(int n, mul(global const float* a, const float* a, global const float* b, const float* b, Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation global float* c) float* c) { { int i = get_global_id(0); int i; for (i=0; i training Page 21 A N-dimensional domain of work-items You must define the “best” N-dimensional index space for your algorithm Global dimensions: 1024 x 1024 (whole problem space) Local dimensions: 128 x 128 (work group, executes together) 1024 Synchronization between work-items: Possible only within the same workgroup Use barriers and memory fences Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation 1024 Cannot synchronize between work-items of two different workgroups Choose the dimensions that are « best » for your algorithm training Page 22 The OpenCL Platform Model One Host is connected to one or more Compute Devices (CD), made of One or more Compute Units (CU), made of One or more Processing Elements (PE), made of One core Some (optional) private memory Some (optional) local memory Some Global and Constants Memory Data Cache Some Global and Constants Memory OpenCL is a device agnostic, general purpose, portable, parallel programming system Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation training Page 23 Original graphics from The Khronos Group The OpenCL Context An OpenCL context is: The environment withing which kernels will execute The domain in which synchronization and memory management is defined. The context includes: A set of devices, The memory accessible to those devices One or more command-queues used to schedule execution of a kernel(s) or operations on memory objects. Contexts contain and manage the state of the “world” in OpenCL. Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation Kernel execution commands Memory commands transfer or mapping of memory object data Synchronization commands constrains the order of execution of commands training Page 24 The OpenCL Execution Model An OpenCL application runs on a host which submits work to the compute devices Work item: The basic unit of work on an OpenCL device Kernel: The code for a work item. Basically a C function Program: Collection of kernels and other functions Analogous to a dynamic library Context: The environment within which work items executes … Includes devices, their memories and command queues Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation Applications queue kernel execution instances They are queued in-order … One queue per device They execute in-order or out-of-order Depending on the queue training Page 25 Original graphics from The Khronos Group OpenCL Kernels The OpenCL execution model: Define a problem domain Execute an instance of a kernel for each point in the domain kernel void square( global float* input, global float* output) { int i = get_global_id(0); output[i] = input[i] * input[i]; } Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation 9 get_global_id(0) input 6 4 1 1 0 1 0 9 4 5 1 1 7 6 3 5 0 1 0 1 8 6 0 3 2 0 1 3 9 0 1 0 output 0 16 1 1 0 1 0 81 16 25 1 1 49 36 9 25 0 1 0 1 64 36 0 9 4 0 1 9 81 0 1 0 training Page 26 The OpenCL Memory Model Private Memory Per work-item Local Memory Shared within a workgroup Global/Constant Memory Visible to all workgroups Host Memory On the main CPU Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation Memory Management is Explicit You must move data from Host Global Local Private and back Private Local Global Host training Page 27 Original graphics from The Khronos Group OpenCL Scalability OpenCL Program Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Dual Core Quad Core Core 0 Core 1 BlockCore 00 BlockCore 11 BlockCore 22 BlockCore 33 BlockCore 00 BlockCore 11 BlockCore 04 BlockCore 15 BlockCore 26 BlockCore 37 Time Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation BlockCore 02 BlockCore 13 BlockCore 04 BlockCore 15 BlockCore 06 BlockCore 17 training Page 28 The OpenCL Language Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation training The OpenCL Language Kernels are programmed using the OpenCL Language It is a subset of ISO C99 Some features are omitted, like Standard C99 headers Function pointers Recursion Variable length arrays Bit fields It is also a superset of ISO C99, with additions for Workitems and workgroups definition Vector types Synchronization Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation Address space qualifiers Also includes a large set of built-in functions Image manipulation Workitem manipulation Specialized math routines ... training Page 30 OpenCL Language Highlights Function qualifiers __kernel qualifier declares a function as a kernel Kernels can call other kernel functions Address space qualifiers __global, __local, __constant, __private Pointer kernel arguments must be declared with an address space qualifier Work-item functions Query work-item identifiers get_work_dim(), get_global_id(), get_local_id(), get_group_id() Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation Synchronization functions Barriers: all work-items within a work-group must execute the barrier function before any work- item can continue Memory fences provides ordering between memory operations training Page 31 Building Program Objects The program object encapsulate A context The program source or binary A list of target devices and build options The build process to create a program object clCreateProgramWithSource() clProgramCreateWithBinary() kernel void Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation horizontal_reflect(read_only image2d_t src, Compile for CPU write_only image2d_t dst) CPU code { int x = get_global_id(0); // x-coord int y = get_global_id(1); // y-coord int width = get_image_width(src); float4 src_val = read_imagef(src, sampler, (int2)(width-1-x, y)); Compile for GPU write_imagef(dst, (int2)(x, y), src_val); GPU code } training Page 32 OpenCL Summary Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation training Page 33 Original graphics from The Khronos Group The OpenCL Embedded Profile Compared to the full specification, there are a number of notable differences Embedded profile devices do not support any 64 bit data. This means that there are no double, ulong, and some vector data types Embedded profile devices do not have to support 3D operations. Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation IEEE 754 rounding requirement is not implemented strictly Full IEEE 754 require the rounding to be to the nearest even number Embedded profile supports only the basic mode of IEEE 754 all rounding can be done nearest to zero training Page 34 OpenCL Implementations Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation training OpenCL Implementations The OpenCL specification is open However Implementations are usually proprietary There is 4 main implementations available now Apple For Mac OS X Snow Leopard NVIDIA For CUDA-based GPGPUs AMD (ATI) For Radeon and newer boards Intel For recent Core2 chips (45nm node and below) Mac OS X Snow Leopard Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation OpenCL is a System Framework of Mac OS X since version 10.6 It is available on all Mac OS X versions since then It manages the multicore CPU as well as the Mac’s GPUs The SDK is available in the Xcode environment You thus do not have to install anything new training Page 36 OpenCL Implementations (2/4) NVIDIA NVIDIA supports OpenCL-1.1 on all its CUDA-compatible graphics boards Starting from GeForce desktop products Support is included in all recent drivers An OpenGL SDK is available for registered developers Registration is free NVIDIA OpenCL support information is available at http://developer.nvidia.com/opencl Installation You must have Microsoft Visual Studio installed Either Visual Studio 2005 or 2008 First check you graphics driver First update to the latest official driver at http://www.nvidia.com/Download/index.aspx If it is more recent than v258.19, it is the one to use Otherwise, you will have to install a beta driver (see below) Then you must download the latest OpenCL SDK Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation For this you must register as a NVIDIA/CUDA developer at http://developer.nvidia.com/join-nvidia-registered-developer-program Then you can download the SDK from https://nvdeveloper.nvidia.com Go to the OpenCL 1.1 Conformance Candidate page and download The dev_driver_3.1_xxxx for your platform and install it if needed You must have a valid graphics driver to continue The gpucomputingsdk_1_1_xxxx for your platform You can then install it The GPU Computing SDK comes with source code and prebuilt samples Although OpenCL sample code is no more bundled in recent SDKs (since fall 2012…) References to OpenCL seems to be vanishing from the nVidia web site… training Page 37 OpenCL Implementations (1/4) AMD (ATI) AMD supports OpenCL-1.2 on most ATI graphics boards (Radeon, FirePro...) Support is included in recent Catalyst drivers An OpenGL SDK is included in AMD Advanced Parallel Programming (APP) SDK The AMD APP SDK replaced the ATI Stream SDK on August 8 2011 AMD APP SDK can be found at http://developer.amd.com/tools/heterogeneous-computing/ amd-accelerated-parallel-processing-app-sdk/ Installation You need a C Compiler; any of Intel C Compiler 11.x or more recent Microsoft Visual Studio Professional 2008 or 2010 Minimalist GNU for Windows (MinGW) with gcc-4.4 Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation A recent AMD Catalyst Driver You can download it from http://support.amd.com/us/gpudownload The AMD Advanced Parallel Programming APP SDK, from http://developer.amd.com/tools/heterogeneous-computing/ amd-accelerated-parallel-processing-app-sdk/downloads/ You can then install it The AMD APP SDK comes with samples training Page 38 OpenCL Implementations (1/4) Intel Intel provides an OpenCL SDK for programming multicore CPUs On the opposite of the Apple, NVIDIA and ATI SDKs it can work without a GPU It needs at least a recent (45nm) Core 2 Duo processor OpenCL-1.2 support is only provied for core i3 and more recent CPUs GPUs are only supported on Windows On Windows you will need Windows 7 and Visual Studio 2008 (2010 for OpenCL-1.2) Installation The OpenCL SDK can be obtained at http://software.intel.com/en-us/vcsource/tools/opencl-sdk You may have to provide a valid email address to accept the license The OpenCL-1.1 SDK is complete and self sufficient It includes samples The OpenCL-1.2 SDK (at opencl-sdk-2013) is still in beta It needs at least a core i3 processor Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation However, OpenCL application cannot execute directly If you deploy it on another machine you must provide a run-time It’s the equivalent of the GPU drivers you need when running on a GPU It is named the OpenCL Installable Client Driver (ICD) The OpenCL runtime can be downloaded at the same address It only exist for Windows For Linux you should distribute the (relevant part of) the SDK training Page 39 Summary In this chapter you have discovered Why we need a new language The Goals of OpenCL Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation The OpenCL Concepts The OpenCL Implementations training Page 40 Questions? Adaptation and diffusion strictly forbidden without Acsys written agreement written Acsys without forbidden strictly diffusion and Adaptation Should you have any further question, feel free to contact us: mailto:[email protected] http://www.ac6-training.com http://www.ac6.fr training