Opencl Basics

Home , Compute kernel, Consistency model, Khronos Group, OpenCL, Thread (computing)

Member of the Helmholtz-Association etrsto n otbePormiguigOeC,21.-22.11.2017 OpenCL, using Programming Portable and Vectorisation Schenck Wolfram Basics OpenCL aut fEg n ah,BeeedUiest fApidSciences Applied of University Bielefeld Math., and Eng. of Faculty Member of the Helmholtz-Association vriwo h Lecture the of Overview 7 6 5 4 3 2 1 xrie2 Exercise Handling Event 1 Exercise Kernels Compute for OpenCL API Host OpenCL Overview OpenCL pedx oe nNomenclature on Notes Appendix: 2 OpenCL Overview Member of the Helmholtz-Association OpenCL PA ihpormiglnug OeC C“ DSPs, „OpenCL language GPUs, programming with CPUs, FPGAs for framework Programming Language) Computing (Open OpenCL • • • • • • • o.21:OeC 2.1 OpenCL 2015: Nov. 2.0 OpenCL 2013: Nov. 2011: Nov. Group Khronos by managed meanwhile IBM, NVIDIA, AMD, and with development Intel, subsequent Apple, by Started ue21:OeC 1.1 OpenCL 2010: June 1.0 OpenCL 2008: Dec. to smartphone from supercomputer) processors; other and GPUs, environments (CPUs, heterogeneous in devices of programming Goal: : pnadrylyfe standard royalty–free and Open rgamn rmwr o otbe parallel portable, for framework Programming pnL1.2 OpenCL 4 Member of the Helmholtz-Association UACv.OpenCL vs. C CUDA PRO PRO platforms heterogeneous Supports manufacturer) from (independent types processor various For libraries extra and tools Many efficient and Mature OpenCL UAC CUDA CONTRA CONTRA rgamn necessary programming long–winded Partly C CUDA as used widely as and mature as Not NVIDIA by GPUs for usable Only 5 Member of the Helmholtz-Association diinlInformation Additional • • • • . . . http://www.fixstars.com/en/opencl/book/ OpenCL: on EBook https://software.intel.com/de-de/forums/opencl OpenCL: on Zone Developer Intel http://developer.amd.com/tools-and-sdks/opencl-zone/ Zone: OpenCL AMD specifications, (News, http://www.khronos.org/opencl website: Khronos /OpenCLProgrammingBook/contents/ dio-ely 01 SN 978-0321749642 ISBN: 2011, Addison-Wesley, Ginsburg Dan Mattson, Fung, G. James Timothy Gaster, Benedict Munshi, Aaftab Guide Programming OpenCL A pages MAN ) . . . 6 Member of the Helmholtz-Association Tutorials • • . . . http://www.training.prace-ri.eu/ website: PRACE on OpenCL–Tutorials Mattson) T. and Kartoshkin V. (from . . . Programmers: https://software.intel.com/en-us/articles/ HPC for Introduction — OpenCL I I I UAadOpenCL and CUDA hybrid architectures new for paradigms Programming Course: PATC Tools Level Programming High and OpenCL Programming. Heterogeneous /training_material/?tx_pracetmo_pi1[tag]=opencl /tutorial-opencl-introduction-for-hpc-programmers ...... 7 Member of the Helmholtz-Association rhtcueo OpenCL of Architecture ttepormiglevel: programming the At level: conceptional the At • • • • • • • pnLC(rgamn language) (programming C OpenCL API Runtime OpenCL API Platform OpenCL model Programming model Memory model Execution model Platform 8 Member of the Helmholtz-Association i. Wikipedia Fig.: ltomModel Platform 9 Member of the Helmholtz-Association i. Wikipedia Fig.: • • • • ltomMdl(cont.) Model Platform ute subdivision: Further library OpenCL via accessed Device: devices several to connected structure: Basic h otpormruns program host the Host: I I I I Device FPGAs DSPs, GPUs, CPUs, Examples: −→ Unit Compute sal:CUo h optrsystem computer the of CPU Usually: opttoa nto which on unit Computational Poesn Elements“ „Processing opttoa ntwihis which unit Computational −→ CmueUnits“ „Compute otwihis which Host 10 Member of the Helmholtz-Association ltomMdl(CPU) Model Platform CPU • • • h IDwidth SIMD the lanes, SIMD to mapped (PE): element Processing thread) (CU): unit Compute system Device: l Pso h anor ftecomputer the of mainboard the on CPUs All n Uprcr o e hardware per (or core per CU One n E e U where CU, per PEs EprC,o fPsare PEs if or CU, per PE 1 n matches 11 Member of the Helmholtz-Association i. NVIDIA Fig.: 6SraigMliPoesr (SM) Multi–Processors Streaming 16 GF100/GF110 Fermi: i shot Die 12 GF100/GF110: Streaming Multi–Processor (SM)

SM properties • 32 CUDA cores (Streaming processors/SP) • 16 Load/store units • 4 Special function units (SFU) • 2 Warp scheduler

: 512 ALUs/FPUs available

Fig.: NVIDIA Member of the Helmholtz-Association ltomMdl(P n MIC) and (GPU Model Platform MIC GPU • • • • • • Device: oSM lanes, SIMD to (PE): element Processing ( (CU): unit Compute Device: (AMD) lane” “SIMD (PE): element Processing (CU): unit Compute = 4 × [ fcores of # ahGUi h ytmat ssnl device single as acts system the in GPU Each ahMCi h ytmat ssnl device single as acts system the in MIC Each n E e U where CU, per PEs − 1 n Uprhrwr thread hardware per CU One (NVIDIA) multi–processor per CU One ] ) EprC,o fPsaemapped are PEs if or CU, per PE 1 or (NVIDIA) core CUDA per PE 1 n ace h IDwidth SIMD the matches 14 Member of the Helmholtz-Association i. Wikipedia Fig.: ltomMdl(cont.) Model Platform Platform • • • model“). application ( single a be within may from and used host one on coexist can manufacturers various of Platforms it. belonging to devices the control to host the enables platform specific Each „platform“. so–called a defines library) OpenCL underlying (with implementation OpenCL Every ICD: isalbecin driver client „installable 15 Member of the Helmholtz-Association rcia Hints Practical Model Platform e pnLrnigudrLinux under running OpenCL Get • • • • edrfiles: Header ehns truntime: at Mechanism devices OpenCL your of vendors the all libraries: from OpenCL platform–specific Get and files definition ICD devices OpenCL your of vendors the of one loader: from ICD Get with stub library OpenCL I I I I I I al oOeC irr ucin r otdt h correct the to routed are implementation functions library OpenCL to Calls libraries OpenCL uses loader ICD libOpenCL.so in: located usually files ICD file: Central file: Central libOpenCL.so CL/cl.h e rmKrnswbie(e.g.) website Khronos from Get sdnmclylne oyu plcto truntime at application your to linked dynamically is dlopen(..) /etc/OpenCL/vendors/ ooe l eurdplatform–specific required all open to 16 Member of the Helmholtz-Association i. TeOeC pcfiain11,Fg 3.2 Fig. 1.1“, Specification OpenCL „The Fig.: F w s S Work–Items of 2D–Arrangement Example: Model Execution x x x x , , , , s F S w y y y y nie fawork– a of Indices : work– of Number : nie fawork– a of Indices : lblidxoffset index Global : NDRange the within group work–group single a within items work–group a within item 17 Member of the Helmholtz-Association xuss hedMngmn nGPUs on Management Thread Excursus: Cas rie Parallelism“ Grained „Coarse Solution threads of thousands many of Management Challenge Kernel • • iutnosyi aallthreads parallel in running simultaneously instantiations kernel Many GPU) scenario: (here: Typical device the on execution for Function : Fn rie Parallelism“ Grained „Fine 18 Member of the Helmholtz-Association i. NVIDIA Fig.: hedMngmn (cont.) Management Thread irrhcltra organization thread Hierarchical oe level Lower eimlevel Medium level Upper hed(qi.t work–item) to (equiv. Thread : work–group) to (equiv. Block : rd(qi.t NDRange) to (equiv. Grid : ⇒ ⇒ temn rcso (SP) Processor Streaming (SM) Multi–Processor Streaming ⇒ Device 19 Member of the Helmholtz-Association hedMngmn (cont.) Thread–Management Warp/Wavefront Scheduler Block • • • • • • itiue ruso okies(wr–rus)t SMs to (“work–groups”) work–items of groups Distributes tti ee:SIMD level: work–group/CU) this a At (within together executed and scheduled work–group/SM) Wavefront a (within together executed and Warp procedure) (“round–robin” Load–balancing Goal: memory, work–items) local of (registers, number account into capacity free Takes NII) ru f3 okieswihaescheduled are which work–items 32 of Group (NVIDIA): AD:Gopo 4wr–tm hc are which work–items 64 of Group (AMD): : : Fn rie Parallelism“ Grained „Fine Cas rie aalls“(NVIDIA) Parallelism“ Grained „Coarse 20 Member of the Helmholtz-Association aec Hiding Latency 21 Member of the Helmholtz-Association nmesfrGF110) for (numbers (cont.) Thread–Management • • • ntegi:U o65535 to Up grid: the In Warps) 48 as (organized SM per work–items 1536 to Up SM per scheduled actively work–groups 8 to Up I I : : pt 65535 to Up active simultaneously threads 100 Several CPU: with Comparison work–items 16 Max. SMs: 16 With 8 in result not work–group (does per work–items 1024 to Up × 56=256smlaeul scheduled simultaneously 24576 = 1536 3 × 1024 × ≈ 1024 3 2 work–items . 88 : · 10 e etitem) next see 17 okiesprkre call kernel per work–items 22 NVIDIA Tesla Graphics Cards in Comparison

GT200 GF110 GK110 GK110B (C1060) (M2090) (K20X) (K40)

# of multi–procs. (SM/SMX) 30 16 14 15 # of cuda cores (per SM/SMX) 8 32 192 192 # of cuda cores (overall) 240 512 2688 2880 Clock (core/shader) [MHz] 602/1296 650/1300 735/735 745/875 GFLOPs (SP) 933 1331 3951 4290 GFLOPs (DP) 78 665 1317 1430 Memory bandwidth [GB/sec] 102 177 250 288 # of registers (per SM/SMX) 16384 32768 65536 65536 Shared mem. (per SM/X) [KB] 32 16–48 16–48 16–48 L1–cache (per SM/SMX) [KB] 0 16–48 16–48 16–48 L2–cache [KB] 0 768 1536 1536 Max threads per SM/SMX 1024 1536 2048 2048 Max blocks per SM/SMX 8 8 16 16 Max threads in flight 30720 24576 30720 30720 Member of the Helmholtz-Association Components Model Execution • • • • ai distinction: Basic otmanages Host defines Host device: on Hierarchy I I I I I I I I I I Synchronization enlexecution Kernel form) compiled in kernels objects and Memory code source (kernel device) objects the Program on execution for (OpenCL–functions Kernels Devices NDRange kernel device Executes Device: program host Executes Host: prtoso eoyobjects memory on Operations B ains nodr n u–fodrexecution out–of–order und In–order– Variants: ol rmsnl platform!) single from (only −→ Context Queues Work–Group : : −→ Work–Item 24 Member of the Helmholtz-Association i. Wikipedia Fig.: eoyModel Memory 25 Member of the Helmholtz-Association al:„h pnLSeicto .“ a.3.1 Tab. 1.1“, Specification OpenCL „The Table: loainadAccess and Allocation Model Memory ekCnitnyModel Consistency Weak • • tsnhoiainpit ths level host at points synchronization at work–groups between work–group Consistency within points synchronization at Only work–group within Consistency o lbladlclmemory: local and global for o lblmmr:Only memory: global for 26 Member of the Helmholtz-Association upre Approaches Supported Model Programming akParallel Task Parallel Data • • • • • aoe eiecas ut–oeCU,mliCUsystems multi–CPU CPUs, Multi–core class: device Favored NDRange: and data between mappings Possible euvln oa Dag ihol n work–item) one via: only Parallelism with NDRange an instance to kernel (equivalent single a only of Execution GPUs class: device Favored I I I I IDuiso h eie(sn pnLvco aatypes) data vector OpenCL (using device the on units SIMD possible work–item also one mappings element flexible data More each For mapping: 1:1 Strict utpetssi uu hc r xctdasynchronously executed are which queue in tasks Multiple 27 Member of the Helmholtz-Association b. hoo Group Khronos Abb.: h BgPicture“ „Big The 28 Member of the Helmholtz-Association ai rgamn tp nHs Code Host in Steps Programming Basic OpenCL : 6 5 4 3 2 1 olcino results of Collection best the kernel with each order for correct device the suited in kernels the of images) Execution (buffers, objects memory initialize and Create kernels OpenCL the configure and Compile runtime adapt during to dynamically execution component program each of properties specific Query system heterogeneous the of components Determine pnLPafr n utm API Runtime and Platform OpenCL steps: these all for Functions : rgamn agaefrkre oe pnLC OpenCL code: kernel for language Programming 29 OpenCL Host API Member of the Helmholtz-Association . . . Steps Programming Basic nPractice in • • • • • • • • • la up Clean devices the for context Create platform the of devices Query platforms Query oymmr bet ihrslsfo eiet host to queue) device via from (invoke results with objects memory Copy execution: Kernel context) (within objects memory Create context) (for object program device) Create and context (for queue Create I I 2 1 raekre cnandi program) in (contained kernel Create program Compile u enlit queue into kernel Put arguments kernel Set . . . : selection : Execution : . . . selection ← rmCstring C from 31 Member of the Helmholtz-Association ur Platforms Query num_platforms platforms num_entries value system Return the on platforms OpenCL all Query , num_entries cl_uint ( clGetPlatformIDs cl_int eun ubro platforms of number Returns : eun nomto bu h ltom frec platform each (for platforms the about information Returns : type of elements pre-allocated of Number : to equal (ideally code Error : n lmn ntearray the in element one array the in platforms ); num_platforms * cl_uint , platforms * cl_platform_id platforms CL_SUCCESS ) ) cl_platform_id 32 Member of the Helmholtz-Association ur ltom (cont.) Platforms Query eae functions Related of invocation Double • • • • clGetPlatformInfo(..) invocation: 2. Allocate : invocation: 1. platforms Query num_platforms num_platforms u_nre num_platforms = num_entries NULL = platforms 0, = num_entries clGetPlatformIDs(..) lmnso type of elements cl_platform_id : necessary Query platforms ntearray the in 33 Member of the Helmholtz-Association rcniin ltomexists Platform Precondition: Devices Query num_devices devices num_entries device_type platform ur h eie eogn otersetv platform value respective Return the to belonging devices the Query , platform cl_platform_id ( clGetDeviceIDs cl_int eun ubro devices of number Returns : device each (for devices the about information Returns : type of elements pre-allocated of Number : (e.g. category Device : platform Selected : ro oe(dal qa to equal (ideally code Error : n lmn ntearray the in element one array the in devices ); num_devices * , cl_uint num_entries cl_uint , devices * cl_device_id , device_type cl_device_type CL_DEVICE_TYPE_CPU devices CL_SUCCESS ) ) cl_device_id , CL_DEVICE_TYPE_GPU ) 34 Member of the Helmholtz-Association ur eie (cont.) Devices Query eae functions Related of invocation Double • • • • clGetDeviceInfo(..) invocation: 2. Allocate : invocation: 1. devices Query num_devices num_devices u_nre num_devices = num_entries NULL = devices 0, = num_entries clGetDeviceIDs(..) lmnso type of elements cl_device_id : necessary Query devices ntearray the in 35 Member of the Helmholtz-Association rcniin eieexists Device Precondition: Context Create errcode_ret devices num_devices properties value Return context a of Creation cl_context , properties * cl_context_properties const ( clCreateContext eun h ro oe(dal qa to equal created (ideally be code shall error context the the created Returns which be : for context shall devices context the with the of Array which properties : for desired devices the of of Number definition : the for field Bit : context created The : o oedetails more For )( ), ) pfn_notify * voidCL_CALLBACK ( , user_data * void , devices * cl_device_id const ); errcode_ret * cl_int , num_devices cl_uint user_data * void , cb size_t , private_info * , errinfo void * const char const : e pnLmnpages man OpenCL see CL_SUCCESS ) 36 Member of the Helmholtz-Association rcniin otx n eieexist device and Context Precondition: Queue Create errcode_ret properties device context value Return queue a of Creation h eal oefrqee s“nodreeuin ohrstig possible settings (other execution” order “in parameter is via queues for mode default The Hint cl_command_queue , context cl_context ( clCreateCommandQueue i edfrtedfiiino h eie rpriso h queue to the equal of (ideally properties code desired error the the of Returns definition : the created for be field shall Bit queue : the created which be for shall Device queue : the which within Context : queue created The : properties ). ); errcode_ret * cl_int , device cl_device_id , properties cl_command_queue_properties CL_SUCCESS ) 37 Member of the Helmholtz-Association rcniin otx n orecd exist code source and Context Precondition: Object Program Create errcode_ret length strings context count value Return object program a of Creation fe nyoeca ufrwihcnan h opeesuc code. source complete the contains which the buffer with char files one text only from often before extension: in (file read code been source have OpenCL buffers char the often Most Hint cl_program , context cl_context ( clCreateProgramWithSource eun h ro oe(dal qa to equal bytes) (ideally (in code buffer error char code the each source Returns of the : length containing the buffers specifying char Array the created : to be pointers shall with object Array program : the (see which code within source Context with : buffers char of Number : object program created The : , lengths * , strings size_t ** const char const ); errcode_ret * cl_int , count cl_uint .cl .Frsalapiain,teeis there applications, small For ). CL_SUCCESS strings ) ) 38 Member of the Helmholtz-Association rcniin rga betaddvc()exist device(s) and object Program Precondition: Program Compile options device_list num_devices eunvalue devices Return listed the for program the Compile cl_int , program cl_program ( clBuildProgram hrsrn ihcmie options compiler with string compiled Char be : shall program the compiled which be for shall devices program with the Array which : for devices of Number : ro oe(dal qa to equal (ideally code Error : teems eogt h aecneta h rga object!) program the as context same the to belong must (these ), ); user_data * )( void pfn_notify * CL_CALLBACK ( void , , options * device_list * char const cl_device_id const , num_devices cl_uint user_data * void , program cl_program CL_SUCCESS ) 39 Member of the Helmholtz-Association opl rga (cont.) Program Compile eae functions Related Hints : • • • o ytxerr ihnteOeC eiesuc code) source e.g. device messages, OpenCL error the (with within logs errors compiler syntax the for and status build the Query clGetProgramBuildInfo(..) object. program the devices of the context for the platform to the belong implements which one which the library — OpenCL used the is from implementation compiler right the Automagically, details more For : e pnLmnpages man OpenCL see 40 Member of the Helmholtz-Association rcniin rga betwt opldcd exists code compiled with object Program Precondition: Kernel Create h enli fewrsaalbefraldvcswihwr otie in contained were which devices all for the available afterwards is kernel The errcode_ret kernel_name program value Return kernel compute a of Creation Hint , program cl_program ( clCreateKernel cl_kernel device_list aeo h enlfnto wti h orecd ftepormobject) program to the equal of (ideally code code source error the the (within Returns function code : kernel kernel the compiled of the Name contains : which object program The : kernel created The : hncalling when clBuildProgram(..) , kernel_name * char const ); errcode_ret * cl_int CL_SUCCESS before. ) 41 Member of the Helmholtz-Association rcniin otx exists Context Precondition: Objects Memory Create errcode_ret host_ptr size flags context value Return object buffer a of Creation Here: , context cl_context ( clCreateBuffer cl_mem raino ufr(lentvl:sbbfe,image) buffer, sub (alternatively: buffer a of Creation eun h ro oe(dal qa to equal (ideally code error the Returns : used is which memory host in area memory the to bytes) Pointer (in : buffer the of Length properties : buffer the of definition the created for be field shall Bit buffer : the which within Context : buffer created The : ssuc o oyoeain rwihi ietyue for used directly is which buffer or the operations copy for source as creation at executed operations copy the of and , host_ptr * void ); errcode_ret * cl_int , size size_t , flags cl_mem_flags CL_SUCCESS ) 42 Explanation of the Parameter flags (Disjunct. within the Bit Field)

Flag Meaning CL_MEM_READ_WRITE Memory object will be read and written by a kernel. CL_MEM_READ_ONLY Memory object will only be read by a kernel. CL_MEM_WRITE_ONLY Memory object will only be written by a kernel. CL_MEM_USE_HOST_PTR The buffer shall be located in host memory at address host_ptr (content may be cached in device memory). Not combinable with CL_MEM_ALLOC_HOST_PTR or CL_MEM_COPY_HOST_PTR. CL_MEM_ALLOC_HOST_PTR The buffer will be newly allocated in host memory (: in some implementations page–locked memory!). CL_MEM_COPY_HOST_PTR The buffer will be initialized with the content of the memory region to which host_ptr points. Member of the Helmholtz-Association rcniin enlexists Kernel Precondition: Arguments Kernel Set arg_value arg_size arg_index kernel eunvalue Return argument kernel single a Set Hints , kernel cl_kernel ( clSetKernelArg cl_int • • egho h buffer)! the of length you case, argument, this kernel In as corresponding buffer the memory use global to a have pass to want you If one otevleo h argument the of bytes) value (in the argument to the Pointer of : value the of first Length the : for 0 with (starting argument set the is of argument Index the : which for kernel The : ro oe(dal qa to equal (ideally code Error : rueto h enlfunction) kernel the of argument arg_size a ob h ieo the of size the be to has ); arg_value * void const , arg_size size_t , arg_index cl_uint cl_mem CL_SUCCESS beta value. as object ) cl_mem bet( object not the 44 Member of the Helmholtz-Association i. TeOeC pcfiain11,Fg 3.2 Fig. 1.1“, Specification OpenCL „The Fig.: F w s S Work–Items of 2D–Arrangement Example: (repeated) Model Execution x x x x , , , , s F S w y y y y nie fawork– a of Indices : work– of Number : nie fawork– a of Indices : lblidxoffset index Global : NDRange the within group work–group single a within items work–group a within item 45 Member of the Helmholtz-Association rcniin uu n enleit enlagmnsarayset already arguments kernel exist, kernel and Queue Precondition: Execution Kernel global_work_offset work_dim kernel command_queue queue value a Return in execution for kernel a Place local_work_size global_work_size cl_int , command_queue cl_command_queue ( clEnqueueNDRangeKernel : following the (concerning dimensions array of Number executed : be to execution kernel for The used : be shall which Queue : to equal (ideally code Error : : : F parameters) three okiesi ahdmninars l work–groups!) all across dimension each in work–items G S G x x x x / , event_wait_list * , cl_event , local_work_size * const , global_work_size * size_t global_work_offset * size_t const size_t const const , num_events_in_wait_list cl_uint , work_dim cl_uint ); event * cl_event , kernel cl_kernel ( ( ( F S S G y y x y , , , , F S G G z z y sepeeigslide) preceding (see ) z sepeeigsie h ratios the slide; preceding (see ) / sepeeigsie vrl ubrof number overall slide; preceding (see ) S y , G z / S z edt eitgrnumbers!) integer be to need CL_SUCCESS ) 46 Member of the Helmholtz-Association enlEeuin(cont.) Execution Kernel Hints • • • uoaial determined. automatically o eal neethandling event on details For parameters the If by defined are NDRange the global_work_offset of structure and size The local_work_size sstto set is , global_work_size NULL : e ae slides later see h ieo h okgop ilbe will work–groups the of size the , and local_work_size . 47 Member of the Helmholtz-Association rcniin uu exists Host Queue to Precondition: Device from Data Transfer ptr cb offset blocking_read buffer execution) command_queue kernel after results with buffer value (e.g., Return memory host into content buffer Copy , command_queue cl_command_queue ( clEnqueueReadBuffer cl_int o eal neethandling event on details For one othe to Pointer copy : to bytes bytes) of (in Number buffer : the in offset Read : If as : serves which execution object for Buffer used : be shall which Queue : to equal (ideally code Error : loae nsfcetsz before) size sufficient in allocated mode”) “in–order in operates in it commands if preceding queue all the also therefore (and finished been true h ucinol eun fe h oyoeainhas operation copy the after returns only function the , , * ptr void , event_wait_list * cl_event const , cb , offset size_t size_t , buffer cl_mem , num_events_in_wait_list cl_uint , blocking_read cl_bool ); event * cl_event agtregion target CL_SUCCESS nhs eoy(ed obe to (needs memory host in source : e ae slides later see ftecp operation copy the of ) 48 Member of the Helmholtz-Association (Selection) Resources OpenCL Free liaeyfe h eore fteobject. will the call of release resources subsequent were the next calls free the retain ultimately call, all release after a Only by it. object–internal compensated decrease an functions increase many release functions for the retain exist counter, The functions objects. retain OpenCL also of functions types release the to analogy In value objects Return OpenCL of types different of Release Hint ); ); memobj ); ); kernel cl_mem ( program command_queue cl_kernel ); ( cl_program clReleaseMemObject cl_command_queue context ( clReleaseKernel cl_int ( cl_context clReleaseProgram cl_int ( clReleaseCommandQueue cl_int clReleaseContext cl_int cl_int ro oe(dal qa to equal (ideally code Error : CL_SUCCESS ) 49 OpenCL for Compute Kernels Member of the Helmholtz-Association ai at bu OeC C” “OpenCL about Facts Basic • • • • • • ucin rmteC9sadr headers standard no C99 pointers, the function from no functions recursion, No restrictions: few A C99 ISO from Derived pinlbiti ucin cle “extensions”) (called functions built–in Optional functions: built–in Mandatory types: data Built–in supported are (e.g., C99 by defined directives Preprocessing I I I I I upr o obepeiin tmc ogoa n oa memory local and global to atomics precision, double for Support rnf(12only) (v1.2 functions printf synchronization functions, images geometric of functions, writing Relational and reading math.h, functions, Work–item images pointers, types, data vector and Scalar #include ) 51 Member of the Helmholtz-Association ulfir n Functions and Qualifiers • • • • drs pc qualifiers: space Address qualifiers: Function ycrnzto functions: Synchronization functions: Work-item I I I I I I eoyfne rvdsodrn ewe eoyoperations: flags) memory between mem_fence(cl_mem_fence_flags ordering provides — fences Memory are(lmmfnefasflags) continue: can the barrier(cl_mem_fence_flags work-item execute any must before work-group function a barrier within work-items all — Barriers get_work_dim() space address an with (excl. declared qualifier be must arguments kernel Pointer __global enqueued be can it that so code host to visible __kernel get_group_id() , ulfirdcae ucina enl ..mksit makes i.e. kernel, a as function a declares qualifier __local __private , etc. , get_global_id() , __constant ) , __private , get_local_id() , 52 Member of the Helmholtz-Association Restrictions • • • • • • • onest ucin r o allowed not are functions to Pointers supported not is Recursion obetpsaeotoa nOeC 11 u h e odis double) word key support the implementations but Most v1.1, (note: OpenCL reserved in optional are types use Double way; same the the files) both exactly header in in common (naturally, defined code be device to and have host types data other and Structures supported not are arrays length Variable supported not are an Bit–fields as invocation not kernel but a kernel, to a argument within allowed pointers to Pointers 53 Exercise 1 Member of the Helmholtz-Association akadHints and Task Hints Task • • • • • o r using are you in settings Adjust in code device Use in code host Use train025@zam1069:OpenCL_Course/OpenCL_Basics/example from files project Copy two! of instead vectors three of addition the Implement VectorAddition.C Makefile vectoradd.cl otecmue ytmwhich system computer the to ssatn point starting as ssatn point starting as 55 Event Handling Member of the Helmholtz-Association . . . Functions Useful Further event_list num_events eunvalue Return in with commands associated OpenCL queued previously all Issues value Return in events all for Wait eunvalue Return completed. have and in device commands associated OpenCL the queued previously all until Blocks ); command_queue cl_command_queue ( clFinish cl_int ); command_queue cl_command_queue ( clFlush cl_int , num_events cl_uint ( clWaitForEvents cl_int o vn adigadQueues and Handling Event for ra fevents of Array : in elements of Number : ro oe(dal qa to equal (ideally code Error : to equal (ideally code Error : ro oe(dal qa to equal (ideally code Error : command_queue event_list ); event_list * cl_event const event_list . . . CL_SUCCESS CL_SUCCESS CL_SUCCESS clFinish command_queue sas ycrnzto point. synchronization a also is ) ) ) command_queue otedevice the to r sudto issued are 57 Exercise 2 Member of the Helmholtz-Association akadHints and Task Hints Task • • • • • xedyu oefo xrie1 exercise from code your Extend operation! vector xctdi h ih order! right the in executed . . . default the of instead queue queue out–of–order an Use BONUS: equation pair–wise the and addition) (multiplication kernels both with Compute vector element–wise multiplication! for kernel second a Implement n nueb sn vnsta l omnsare commands all that events using by ensure and . . . e = a ∗ b + c ∗ d selement–wise as 59 Appendix: Notes on Nomenclature Member of the Helmholtz-Association M s NVIDIA vs. AMD Nomenclature C–rh:Compute–Unit GCN–Arch.: AMD lblDt Share Data Global Share Data Local Wavefront lane“) („SIMD SIMD–Einheit SIMD GCN–Arch.: SIMD–Core — — Memory Shared Warp Core CUDA (SP), Proc. Streaming — SMX) (SM, Multi–Processor Streaming (GPC) Cluster Processing (TPC), Graphics Cluster Processing Texture NVIDIA 61 Member of the Helmholtz-Association pnLv.CUDA vs. OpenCL Nomenclature oa Memory Local (Workspace) NDRange Work–Group Work–Item Event Queue Image Memory Private OpenCL Event Stream Texture Memory Registers/Local Memory Shared Grid Block Thread CUDA 62