Cross-Platform GPGPU Organic Vectory BV with Project Services Consultancy Services

Expertise Markets 3D Visualization GPU Computing Architecture/Design X Embedded Software X GIS X X Finance X

George van Venrooij

Organic Vectory B.V. [email protected]

OpenCL

● As defined by the Khronos Group (www.khronos.org): ● OpenCL™ is the first open, royalty-free standard for cross-platform, parallel programming of modern processors found in personal computers, servers and handheld/embedded devices. OpenCL (Open Computing Language) greatly improves speed and responsiveness for a wide spectrum of applications in numerous market categories from gaming and entertainment to scientific and medical software. ● Khronos Group controls many other standards like: ● OpenGL (ES) ● OpenVG ● COLLADA ● WebGL ● … and many more

OpenCL vs. CUDA OpenCL & CUDA Device Memory Model OpenCL CUDA

(source: nVidia OpenCL Tutorial Slides)

Terminology Qualifiers

OpenCL CUDA OpenCL CUDA Work-Item __kernel __global__

Work-Group Thread-Block (no qualifier needed) __device__ (function) Global Memory Global Memory __constant __constant__ Constant Memory Constant Memory __global __device__ (variable) Local Memory Private Memory Local Memory __local __shared__

Indexing API Objects

OpenCL CUDA OpenCL CUDA get_num_groups() gridDim cl_device_id CUdevice

get_local_size() blockDim cl_context CUcontext get_group_id() blockIdx cl_program CUmodule get_local_id() threadIdx cl_kernel CUfunction get_global_id() (calculate manually) get_global_size() (calculate manually) cl_mem CUdeviceptr cl_command_queue CUstream

Kernel Language Device Thread Synchronization

OpenCL CUDA OpenCL CUDA Subset of C99 “C for CUDA”, subset of C barrier() __syncthreads() data-parallel extensions data parallel extensions (no equivalent) __threadfence() C++ features (function templates) enable higher productivity through mem_fence() __threadfence_block() meta-programming techniques Requires run-time compilation by Compilation through separate OpenCL driver compiler: NVCC read_mem_fence() (no equivalent) No function pointers or recursion No function pointers or recursion write_mem_fence() (no equivalent)

Performance Comparison (1) GPU Profiler Tool

accelereyes.com

OpenCL vs CUDA

tested on

Tesla C2050

Performance Comparison (2) Performance Comparison (3)

NVidia GeForce GTX 285 - OpenCL vs CUDA NVidia GeForce GTX 285 - OpenCL vs CUDA ATI vs NVidia ATI vs NVidia Particle simulation PI approximation Particle simulation PI approximation

0.014000 0.600000 0.020000 0.400000 0.012000 0.500000 0.350000

c 0.010000 c c c

0.015000 0.300000 e e 0.400000 e e Nvidia GeForce Nvidia GeForce s s

s s

0.008000 n GTX 285 n GTX 285

0.250000 i i n OpenCL n CUDA

i i

0.300000 e e e 0.010000 CUDA e 0.200000 OpenCL 0.006000 ATI Radeon HD ATI Radeon HD m m i i m m t t i i 5870 5870

t t 0.200000 l l

0.150000

l l 0.004000 a a t t a a t t

0.005000 0.100000 o o o o

T 0.002000 T 0.100000 T T 0.050000 0.000000 OpenCL 0.000000 CUDA 0.000000 Nvidia GeForce GTX 285 0.000000 Nvidia GeForce GTX 285 29929 69696 199809 599076 1000000 20 40 60 80 100 300 500 700 900 40000 99856 698896 30 70 200 600 1000 10000 49729 90000 399424 799236 10 30 50 70 90 200 400 600 800 1000 10000 69696 399424 1000000 10 50 90 400 800 Count Iterations Count Iterations

NVidia GeForce GTX 285 - OpenCL vs CUDA NVidia GeForce GTX 285 - OpenCL vs CUDA ATI vs NVidia ATI vs NVidia Random global memory reads Random global memory writes Random global memory reads Random global memory writes

0.450000 0.450000 0.400000 0.350000 0.400000 0.400000 0.350000 0.300000 0.350000 0.350000 c c c c e e

e 0.300000 e 0.300000 Nvidia GeForce 0.300000 Nvidia GeForce 0.250000 s s

s s

n GTX 285 n GTX 285

i 0.250000 i n 0.250000 CUDA n CUDA 0.250000

i i

0.200000 e e e 0.200000 OpenCL e OpenCL 0.200000 ATI Radeon HD 0.200000 ATI Radeon HD m m i i m m

0.150000 t t i i 5870 5870

t t 0.150000 l l 0.150000 0.150000 l l a a t t a a t t 0.100000 0.100000 0.100000 0.100000 o o o o T T T 0.050000 OpenCL T 0.050000 0.050000 0.050000 0.000000 CUDA 0.000000 CUDA 0.000000 Nvidia GeForce GTX 285 0.000000 Nvidia GeForce GTX 285 20 40 60 80 100 300 500 700 900 20 40 60 80 100 300 500 700 900 30 70 200 600 1000 30 70 200 600 1000 10 30 50 70 90 200 400 600 800 1000 10 30 50 70 90 200 400 600 800 1000 10 50 90 400 800 10 50 90 400 800 Count Count Count Count

Preliminary Conclusions Back to the Host

● There are cases where OpenCL performs better than CUDA ● There are cases where CUDA performs better than OpenCL ● OpenCL seems to have slightly higher overhead for kernel launches compared to CUDA on NVidia's platform ● For some cases the differences can be large, but...

Measuring = knowing!

Host Synchronization: Host Synchronization: CUDA Streams OpenCL Command Queues

● Streams are a sequence of commands that execute in-order ● Default behavior of command queue's is similar to CUDA Streams ● Streams can contain kernel launches and/or memory transfers ● One big difference: out-of-order execution mode ● Host code can wait for stream completion using the cudaStreamSynchronize() call ● clEnqueue...() commands can be given a set of events to wait for ● Events can be inserted into the stream ● Each command itself can generate an event ● Host code can query event completion or perform a blocking wait for an ● event Based on the dependencies between commands in the queue, OpenCL can determine which commands are allowed to execute simultaneously ● Useful for synchronization with host code and timing ● It is possible to create multiple queues for a device ● It is possible have commands in one queue wait for events from a different queue

Task & Intermediate Conclusions

● The commands and the events ● The programming methodology for data-parallel application is virtually they must wait for, create a “task identical, i.e. if you can program in one language/environment, you can graph” program in the other ● OpenCL will execute the ● CUDA currently offers certain productivity advantages at the kernel level

commands in the queue as it sees ● fit, respecting the dependencies NVidia's hardware seems to be more capable on the GPGPU side when specified. compared to ATi's hardware

● ● The end-result is a task-parallel OpenCL has the platform advantage in that it presents a unified platform API framework supporting data-parallel for ALL computing hardware in your machine tasks ● OpenCL programs can be run on hardware from different vendors ● Your application could be written entirely in OpenCL kernels, requiring only a small framework that fills the command queue

OpenCL Implementations Portability to other platforms

AVAILABLE: ● Results of a kernel are guaranteed across platforms Vendor Type Hardware ● Optimal Performance is not Apple CPU x86_64 () nVidia GPU GeForce 8/9 series and higher ● All platforms are required to support data-parallelism, but are not required to ATi GPU R700/800 series support task-parallelism AMD CPU any x86/x86_64 with SSE3 extensions Samsung CPU ARM A9 IBM ACCELERATOR CELL BE ● ZiiLabs CPU ARM OpenCL can be considered a replacement for OpenMP (data-parallel) ANNOUNCED/UPCOMING: ● OpenCL can be considered a replacement for Threads (task-parallel) Imagination Technology GPU PowerVR SGX Series 5 VIA GPU VN 1000 Chipset S3 GPU Chrome 5400E Graphics Processor Apple CPU ARM A4

Libraries & Tools for OpenCL Libraries & Tools for CUDA

● ATi StreamProfiler (ATi hardware only) ● cuBLAS (closed-source) ● NVidia Visual Profiler (NVidia hardware only) ● cuFFT (closed-source) ● Stream KernalAnalyzer (ATi hardware only) ● CUDPP (data-parallel primitives) ● NVidia NSight (NVidia hardware only) ● Thrust (high-level CUDA & OpenMP-based algorithms) ● gDebugger CL (Windows, Mac, Linux, currently in beta) ● CULATools (LAPACK) ● libstdcl (wrapper around context/queue management functions) ● NSight debugger ● GATLAS (Matrix multiplication) ● NVidia Visual Profiler ● ViennaCL (BLAS level 1 and 2) ● Language bindings for Python, , .NET, MATLAB, Fortran, Perl, Ruby, Lua ● Language bindings for C++, Fortran, Java, Matlab, .NET, Python and Scala are available

(Unofficial) Sneak Preview Things to consider ● Platforms ● OpenCL is currently the only choice if you do not want to tie your application to NVidia's hardware ● API stability/agility ● OpenCL changes more slowly, retains backward compatibility ● CUDA changes more rapidly, unlocks new hardware features quicker ● Third-party availability ● OpenCL is about 2 years younger, so less numerous and less mature libraries are available ● CUDA has spawned a host of initiatives and various libraries are available, especially in the scientific computing domain ● Supporting tools ● OpenCL has a fairly young, but decent set of tools ● NVidia recently launched the NSight debugger which seems more mature

Questions Further Reading

● GPGPU

● www.gpgpu.org ● OpenCL General

● www.khronos.org/opencl

● OpenCL Implementations

● http://developer.amd.com/documentation/articles/pages/OpenCL-and-the-ATI-Stream-v2.0-Beta.aspx

● http://developer.nvidia.com/object/opencl.html ? ● http://www.alphaworks.ibm.com/tech/opencl ● http://developer.apple.com/mac/library/documentation/Performance/Conceptual/OpenCL_MacProgGuide/WhatisOpenCL/WhatisOpenCL.html ● OpenCL/CUDA Comparisons

● http://www.gpucomputing.net/?q=node/128

● Mobile/Embedded OpenCL announcements

● http://www.imgtec.com/News/Release/index.asp?NewsID=557

● http://www.ziilabs.com/technology/opencl.aspx

References

● http://blog.accelereyes.com/blog/2010/05/10/nvidia-fermi-cuda-and-opencl/

● http://www.s3graphics.com/en/news/news_detail.aspx?id=44

● http://www.gremedy.com/gDEBuggerCL.php

● http://browndeertechnology.com/stdcl.html

● http://golem5.org/gatlas/

● http://www.mainconcept.com/products/sdks/hw-acceleration/opencl-h264avc.html

● http://awaregeek.com/news/some-pictures-of-old-computers/