Introduction to Opencl

Total Page:16

File Type:pdf, Size:1020Kb

Introduction to Opencl Introduction to OpenCL Ezio Bartocci Vienna University of Technology Overview • Overview of OpenCL for NVIDIA GPUs • API and Languages • Sample codes walkthrough • OpenCL Information and Resources OpenCL – Open Computing Language • OpenCL is an Open, royalty-free C-language extension • It is a framework designed for parallel programming of heterogeneous systems using GPUs, CPUs, FPGA, DSP’s and other processors including embedded mobile devices • It was initially introduced by Apple, now is supported by NVIDIA, Intel, AMD, IBM….(that are in the OpenCL working group) • Managed by Khronos Group OpenCL versions and history (1) OpenCL 1.0 (2008) • OpenCL 1.0 has been released with Mac OS X Snow Leopard OpenCL 1.1 (2010) • The Khronos Group adds significant functionality for enhanced parallel programming flexibility, functionality, and performance including: • New data types including 3-component vectors and additional image formats; • Handling commands from multiple host threads and processing buffers across multiple devices; • Operations on regions of a buffer including read, write and copy of 1D, 2D, or 3D rectangular regions; • • Enhanced use of events to drive and control command execution; • Additional OpenCL built-in C functions such as integer clamp, shuffle, and asynchronous strided copies; • Improved OpenGL interoperability through efficient sharing of images and buffers by linking OpenCL and OpenGL events. OpenCL versions and history (2) OpenCL 1.2 (2011) • Most notable features include: • Device partitioning: the ability to partition a device into sub-devices so that work assignments can be allocated to individual compute units. This is useful for reserving areas of the device to reduce latency for time-critical tasks. • Separate compilation and linking of objects: the functionality to compile OpenCL into external libraries for inclusion into other programs. • Enhanced image support: 1.2 adds support for 1D images and 1D/2D image arrays. Furthermore, the OpenGL sharing extensions now allow for OpenGL 1D textures and 1D/2D texture arrays to be used to create OpenCL images. • Built-in kernels: custom devices that contain specific unique functionality are now integrated more closely into the OpenCL framework. Kernels can be called to use specialised or non-programmable aspects of underlying hardware. Examples include video encoding/decoding and digital signal processors. • DirectX functionality: DX9 media surface sharing allows for efficient sharing between OpenCL and DX9 or DXVA media surfaces. Equally, for DX11, seamless sharing between OpenCL and DX11 surfaces is enabled. NVIDIA OpenCL Support Operative Systems • Windows (XP, VISTA, 8) 32/64 bits • Linux (Ubuntu, RHEL, etc.) 32/64 bits • Mac OSX Snow Leopard IDE’s supported • GCC for Linux • Visual Studio for Windows Drivers and JIT Compiler • They usually are provided with GPU drivers (i.e. CUDA drivers…) NVIDIA SDK • It contains examples of applications, the specification, the programming manual and the best practices guide. OpenCL Language & API Platform Layer API (called from the host) • It is an abstraction layer for diverse computational resources • Query, select and initialize compute devices • Create compute contexts and work-queues Runtime API (called from the host) • Launch compute kernels • Set kernel execution configuration • Manage scheduling, compute, and memory resources OpenCL Language • Write compute kernels that run on a compute device • C-based cross-platform programming interface • Subset of ISO C99 with language extensions • Include rich set of built-in functions • Can be compiled Just In Time(JIT) or offline OpenCL Programming Model OpenCL Programming Model NDRange – N-Dimensional Range N can be 1, 2 or 3. it defines the global index space for each kernel instance. OpenCL Programming Model Work-item • A single kernel instance in the index space. • Each Work-item execute the same compute • Kernel but on different data • Work-items have unique global IDs from the Index space • It can be related to the concept of Thread in CUDA OpenCL Programming Model Work-group • Work-items are further grouped into Work Groups • Work-group have a unique Work-group ID • Work items have a unique local ID within a Work-Group • It can be related to the concept of Block of Threads in CUDA OpenCL Memory Model Private Memory Work Group Work Group Read/Write access Private Private Private Private For Work-item only Memory Memory Memory Memory …….. Work-Item 1 Work-Item M Work-Item 1 Work-Item M Local Memory Read/Write access Compute Unit 1 Compute Unit N For enre Work Group Local Memory Local Memory Constant Memory Read access Global/Constant Memory/ Data Cache For enWre ND-range Compute Device (e.g. GPU) All work-items, all work-groups Global Memory Global Memory Read/write access For enWre ND-range Compute Device Memory All work-items, all work-groups Basic Program Structure Host program • Create memory objects associated to contexts • Compile and create kernel program objects • Issue commands to command-queue • Synchronization of commands PLATFORM LAYER • Clean up OpenCL resources • Query compute devices RUNTIME • Create contexts Compute Kernel (runs on device) OpenCL Language • C code with some restrictions and extensions Basic Program Structure Buffer objects • 1D collection of objects (like C arrays) • Scalar & Vector types, and user-defined Structures • They are accessed via pointers in the compute kernel Image objects • 2D or 3D texture, frame-buffer, or images • Must be addressed through built-in functions Sampler objects • Describe how to sample an image in the kernel • Addressing modes • Filtering modes OpenCL Language Highlights Function qualifiers • “__kernel” qualifier declares a function as a kernel Address space qualifiers • “__global, __local, __constant, __private” Work-item functions • get_work_dim() • get_global_id(), get_local_id(), get_group_id(), get_local_size() Image functions • Image must be accessed through built-in functions • Reads/writes performed through sampler objects from host or defined in source Synchronization functions • Barriers – All work-items within a work-group must execute the barrier function before any work-item in the work-group can continue .
Recommended publications
  • Introduction to the Vulkan Computer Graphics API
    1 Introduction to the Vulkan Computer Graphics API Mike Bailey mjb – July 24, 2020 2 Computer Graphics Introduction to the Vulkan Computer Graphics API Mike Bailey [email protected] SIGGRAPH 2020 Abridged Version This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License http://cs.oregonstate.edu/~mjb/vulkan ABRIDGED.pptx mjb – July 24, 2020 3 Course Goals • Give a sense of how Vulkan is different from OpenGL • Show how to do basic drawing in Vulkan • Leave you with working, documented, understandable sample code http://cs.oregonstate.edu/~mjb/vulkan mjb – July 24, 2020 4 Mike Bailey • Professor of Computer Science, Oregon State University • Has been in computer graphics for over 30 years • Has had over 8,000 students in his university classes • [email protected] Welcome! I’m happy to be here. I hope you are too ! http://cs.oregonstate.edu/~mjb/vulkan mjb – July 24, 2020 5 Sections 13.Swap Chain 1. Introduction 14.Push Constants 2. Sample Code 15.Physical Devices 3. Drawing 16.Logical Devices 4. Shaders and SPIR-V 17.Dynamic State Variables 5. Data Buffers 18.Getting Information Back 6. GLFW 19.Compute Shaders 7. GLM 20.Specialization Constants 8. Instancing 21.Synchronization 9. Graphics Pipeline Data Structure 22.Pipeline Barriers 10.Descriptor Sets 23.Multisampling 11.Textures 24.Multipass 12.Queues and Command Buffers 25.Ray Tracing Section titles that have been greyed-out have not been included in the ABRIDGED noteset, i.e., the one that has been made to fit in SIGGRAPH’s reduced time slot.
    [Show full text]
  • GLSL 4.50 Spec
    The OpenGL® Shading Language Language Version: 4.50 Document Revision: 7 09-May-2017 Editor: John Kessenich, Google Version 1.1 Authors: John Kessenich, Dave Baldwin, Randi Rost Copyright (c) 2008-2017 The Khronos Group Inc. All Rights Reserved. This specification is protected by copyright laws and contains material proprietary to the Khronos Group, Inc. It or any components may not be reproduced, republished, distributed, transmitted, displayed, broadcast, or otherwise exploited in any manner without the express prior written permission of Khronos Group. You may use this specification for implementing the functionality therein, without altering or removing any trademark, copyright or other notice from the specification, but the receipt or possession of this specification does not convey any rights to reproduce, disclose, or distribute its contents, or to manufacture, use, or sell anything that it may describe, in whole or in part. Khronos Group grants express permission to any current Promoter, Contributor or Adopter member of Khronos to copy and redistribute UNMODIFIED versions of this specification in any fashion, provided that NO CHARGE is made for the specification and the latest available update of the specification for any version of the API is used whenever possible. Such distributed specification may be reformatted AS LONG AS the contents of the specification are not changed in any way. The specification may be incorporated into a product that is sold as long as such product includes significant independent work developed by the seller. A link to the current version of this specification on the Khronos Group website should be included whenever possible with specification distributions.
    [Show full text]
  • Lecture 7 CUDA
    Lecture 7 CUDA Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline • GPU vs CPU • CUDA execution Model • CUDA Types • CUDA programming • CUDA Timer ICOM 6025: High Performance Computing 2 CUDA • Compute Unified Device Architecture – Designed and developed by NVIDIA – Data parallel programming interface to GPUs • Requires an NVIDIA GPU (GeForce, Tesla, Quadro) ICOM 4036: Programming Languages 3 CUDA SDK GPU and CPU: The Differences ALU ALU Control ALU ALU Cache DRAM DRAM CPU GPU • GPU – More transistors devoted to computation, instead of caching or flow control – Threads are extremely lightweight • Very little creation overhead – Suitable for data-intensive computation • High arithmetic/memory operation ratio Grids and Blocks Host • Kernel executed as a grid of thread Device blocks Grid 1 – All threads share data memory Kernel Block Block Block space 1 (0, 0) (1, 0) (2, 0) • Thread block is a batch of threads, Block Block Block can cooperate with each other by: (0, 1) (1, 1) (2, 1) – Synchronizing their execution: For hazard-free shared Grid 2 memory accesses Kernel 2 – Efficiently sharing data through a low latency shared memory Block (1, 1) • Two threads from two different blocks cannot cooperate Thread Thread Thread Thread Thread (0, 0) (1, 0) (2, 0) (3, 0) (4, 0) – (Unless thru slow global Thread Thread Thread Thread Thread memory) (0, 1) (1, 1) (2, 1) (3, 1) (4, 1) • Threads and blocks have IDs Thread Thread Thread Thread Thread (0, 2) (1, 2) (2,
    [Show full text]
  • ATI Radeon™ HD 4870 Computation Highlights
    AMD Entering the Golden Age of Heterogeneous Computing Michael Mantor Senior GPU Compute Architect / Fellow AMD Graphics Product Group [email protected] 1 The 4 Pillars of massively parallel compute offload •Performance M’Moore’s Law Î 2x < 18 Month s Frequency\Power\Complexity Wall •Power Parallel Î Opportunity for growth •Price • Programming Models GPU is the first successful massively parallel COMMODITY architecture with a programming model that managgped to tame 1000’s of parallel threads in hardware to perform useful work efficiently 2 Quick recap of where we are – Perf, Power, Price ATI Radeon™ HD 4850 4x Performance/w and Performance/mm² in a year ATI Radeon™ X1800 XT ATI Radeon™ HD 3850 ATI Radeon™ HD 2900 XT ATI Radeon™ X1900 XTX ATI Radeon™ X1950 PRO 3 Source of GigaFLOPS per watt: maximum theoretical performance divided by maximum board power. Source of GigaFLOPS per $: maximum theoretical performance divided by price as reported on www.buy.com as of 9/24/08 ATI Radeon™HD 4850 Designed to Perform in Single Slot SP Compute Power 1.0 T-FLOPS DP Compute Power 200 G-FLOPS Core Clock Speed 625 Mhz Stream Processors 800 Memory Type GDDR3 Memory Capacity 512 MB Max Board Power 110W Memory Bandwidth 64 GB/Sec 4 ATI Radeon™HD 4870 First Graphics with GDDR5 SP Compute Power 1.2 T-FLOPS DP Compute Power 240 G-FLOPS Core Clock Speed 750 Mhz Stream Processors 800 Memory Type GDDR5 3.6Gbps Memory Capacity 512 MB Max Board Power 160 W Memory Bandwidth 115.2 GB/Sec 5 ATI Radeon™HD 4870 X2 Incredible Balance of Performance,,, Power, Price
    [Show full text]
  • AMD Accelerated Parallel Processing Opencl Programming Guide
    AMD Accelerated Parallel Processing OpenCL Programming Guide November 2013 rev2.7 © 2013 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI, the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trade- marks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdic- tions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. The information contained herein may be of a preliminary or advance nature and is subject to change without notice. No license, whether express, implied, arising by estoppel or other- wise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD’s products are not designed, intended, authorized or warranted for use as compo- nents in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or envi- ronmental damage may occur.
    [Show full text]
  • AMD Opencl User Guide.)
    AMD Accelerated Parallel Processing OpenCLUser Guide December 2014 rev1.0 © 2014 Advanced Micro Devices, Inc. All rights reserved. AMD, the AMD Arrow logo, AMD Accelerated Parallel Processing, the AMD Accelerated Parallel Processing logo, ATI, the ATI logo, Radeon, FireStream, FirePro, Catalyst, and combinations thereof are trade- marks of Advanced Micro Devices, Inc. Microsoft, Visual Studio, Windows, and Windows Vista are registered trademarks of Microsoft Corporation in the U.S. and/or other jurisdic- tions. Other names are for informational purposes only and may be trademarks of their respective owners. OpenCL and the OpenCL logo are trademarks of Apple Inc. used by permission by Khronos. The contents of this document are provided in connection with Advanced Micro Devices, Inc. (“AMD”) products. AMD makes no representations or warranties with respect to the accuracy or completeness of the contents of this publication and reserves the right to make changes to specifications and product descriptions at any time without notice. The information contained herein may be of a preliminary or advance nature and is subject to change without notice. No license, whether express, implied, arising by estoppel or other- wise, to any intellectual property rights is granted by this publication. Except as set forth in AMD’s Standard Terms and Conditions of Sale, AMD assumes no liability whatsoever, and disclaims any express or implied warranty, relating to its products including, but not limited to, the implied warranty of merchantability, fitness for a particular purpose, or infringement of any intellectual property right. AMD’s products are not designed, intended, authorized or warranted for use as compo- nents in systems intended for surgical implant into the body, or in other applications intended to support or sustain life, or in any other application in which the failure of AMD’s product could create a situation where personal injury, death, or severe property or envi- ronmental damage may occur.
    [Show full text]
  • Novel Methodologies for Predictable CPU-To-GPU Command Offloading
    Novel Methodologies for Predictable CPU-To-GPU Command Offloading Roberto Cavicchioli Università di Modena e Reggio Emilia, Italy [email protected] Nicola Capodieci Università di Modena e Reggio Emilia, Italy [email protected] Marco Solieri Università di Modena e Reggio Emilia, Italy [email protected] Marko Bertogna Università di Modena e Reggio Emilia, Italy [email protected] Abstract There is an increasing industrial and academic interest towards a more predictable characterization of real-time tasks on high-performance heterogeneous embedded platforms, where a host system offloads parallel workloads to an integrated accelerator, such as General Purpose-Graphic Processing Units (GP-GPUs). In this paper, we analyze an important aspect that has not yet been considered in the real-time literature, and that may significantly affect real-time performance if not properly treated, i.e., the time spent by the CPU for submitting GP-GPU operations. We will show that the impact of CPU-to-GPU kernel submissions may be indeed relevant for typical real-time workloads, and that it should be properly factored in when deriving an integrated schedulability analysis for the considered platforms. This is the case when an application is composed of many small and consecutive GPU com- pute/copy operations. While existing techniques mitigate this issue by batching kernel calls into a reduced number of persistent kernel invocations, in this work we present and evaluate three other approaches that are made possible by recently released versions of the NVIDIA CUDA GP-GPU API, and by Vulkan, a novel open standard GPU API that allows an improved control of GPU com- mand submissions.
    [Show full text]
  • Opencl on the GPU San Jose, CA | September 30, 2009
    OpenCL on the GPU San Jose, CA | September 30, 2009 Neil Trevett and Cyril Zeller, NVIDIA Welcome to the OpenCL Tutorial! • Khronos and industry perspective on OpenCL – Neil Trevett Khronos Group President OpenCL Working Group Chair NVIDIA Vice President Mobile Content • NVIDIA and OpenCL – Cyril Zeller NVIDIA Manager of Compute Developer Technology Khronos and the OpenCL Standard Neil Trevett OpenCL Working Group Chair, Khronos President NVIDIA Vice President Mobile Content Copyright Khronos 2009 Who is the Khronos Group? • Consortium creating open API standards ‘by the industry, for the industry’ – Non-profit founded nine years ago – over 100 members - any company welcome • Enabling software to leverage silicon acceleration – Low-level graphics, media and compute acceleration APIs • Strong commercial focus – Enabling members and the wider industry to grow markets • Commitment to royalty-free standards – Industry makes money through enabled products – not from standards themselves Silicon Community Software Community Copyright Khronos 2009 Apple Over 100 companies creating authoring and acceleration standards Board of Promoters Processor Parallelism CPUs GPUs Multiple cores driving Emerging Increasingly general purpose performance increases Intersection data-parallel computing Improving numerical precision Multi-processor Graphics APIs programming – Heterogeneous and Shading e.g. OpenMP Computing Languages Copyright Khronos 2009 OpenCL Commercial Objectives • Grow the market for parallel computing • Create a foundation layer for a parallel
    [Show full text]
  • History and Evolution of the Android OS
    View metadata, citation and similar papers at core.ac.uk brought to you by CORE provided by Springer - Publisher Connector CHAPTER 1 History and Evolution of the Android OS I’m going to destroy Android, because it’s a stolen product. I’m willing to go thermonuclear war on this. —Steve Jobs, Apple Inc. Android, Inc. started with a clear mission by its creators. According to Andy Rubin, one of Android’s founders, Android Inc. was to develop “smarter mobile devices that are more aware of its owner’s location and preferences.” Rubin further stated, “If people are smart, that information starts getting aggregated into consumer products.” The year was 2003 and the location was Palo Alto, California. This was the year Android was born. While Android, Inc. started operations secretly, today the entire world knows about Android. It is no secret that Android is an operating system (OS) for modern day smartphones, tablets, and soon-to-be laptops, but what exactly does that mean? What did Android used to look like? How has it gotten where it is today? All of these questions and more will be answered in this brief chapter. Origins Android first appeared on the technology radar in 2005 when Google, the multibillion- dollar technology company, purchased Android, Inc. At the time, not much was known about Android and what Google intended on doing with it. Information was sparse until 2007, when Google announced the world’s first truly open platform for mobile devices. The First Distribution of Android On November 5, 2007, a press release from the Open Handset Alliance set the stage for the future of the Android platform.
    [Show full text]
  • Rowpro Graphics Tester Instructions
    RowPro Graphics Tester Instructions What is the RowPro Graphics Tester? The RowPro Graphics Tester is a handy utility to quickly check and confirm RowPro 3D graphics and live water will run in your PC. Do I need to test my PC graphics? If any of the following are true you should test your PC graphics before installing or upgrading to RowPro 3: If your PC shipped new with Windows XP. If you are about to upgrade from RowPro version 2. If you have any doubts or concerns about your PC graphics system. How to download and install the RowPro Graphics Tester Click the link above to download the tester file RowProGraphicsTest.exe. In the download dialog box that appears, click Save or Save this program to disk, navigate to the folder where you want to save the download, and click OK to start the download. IMPORTANT NOTE: The RowPro Graphics Tester only tests if your PC has the required graphics components installed, it is not a graphics performance test. Passing the RowPro Graphics Test is not a guarantee that your PC will run RowPro at a frame rate that is fast enough to be useful. It is however an important test to confirm your PC is at least equipped with the necessary graphics components. How to run the RowPro Graphics Tester 1. Run RowProGraphicsTest.exe to run the test. The test normally completes in less than a second. 2. If any of the results show 'No', check the solutions below. 3. Click the x at the top right of the test panel to close the test.
    [Show full text]
  • Opengl Shading Languag 2Nd Edition (Orange Book)
    OpenGL® Shading Language, Second Edition By Randi J. Rost ............................................... Publisher: Addison Wesley Professional Pub Date: January 25, 2006 Print ISBN-10: 0-321-33489-2 Print ISBN-13: 978-0-321-33489-3 Pages: 800 Table of Contents | Index "As the 'Red Book' is known to be the gold standard for OpenGL, the 'Orange Book' is considered to be the gold standard for the OpenGL Shading Language. With Randi's extensive knowledge of OpenGL and GLSL, you can be assured you will be learning from a graphics industry veteran. Within the pages of the second edition you can find topics from beginning shader development to advanced topics such as the spherical harmonic lighting model and more." David Tommeraasen, CEO/Programmer, Plasma Software "This will be the definitive guide for OpenGL shaders; no other book goes into this detail. Rost has done an excellent job at setting the stage for shader development, what the purpose is, how to do it, and how it all fits together. The book includes great examples and details, and good additional coverage of 2.0 changes!" Jeffery Galinovsky, Director of Emerging Market Platform Development, Intel Corporation "The coverage in this new edition of the book is pitched just right to help many new shader- writers get started, but with enough deep information for the 'old hands.'" Marc Olano, Assistant Professor, University of Maryland "This is a really great book on GLSLwell written and organized, very accessible, and with good real-world examples and sample code. The topics flow naturally and easily, explanatory code fragments are inserted in very logical places to illustrate concepts, and all in all, this book makes an excellent tutorial as well as a reference." John Carey, Chief Technology Officer, C.O.R.E.
    [Show full text]
  • Adaptive GPU Tessellation with Compute Shaders Jad Khoury, Jonathan Dupuy, and Christophe Riccio
    i i i i Adaptive GPU Tessellation with Compute Shaders Jad Khoury, Jonathan Dupuy, and Christophe Riccio 1.1 Introduction GPU rasterizers are most efficient when primitives project into more than a few pixels. Below this limit, the Z-buffer starts aliasing, and shad- ing rate decreases dramatically [Riccio 12]; this makes the rendering of geometrically-complex scenes challenging, as any moderately distant poly- gon will project to sub-pixel size. In order to minimize such sub-pixel pro- jections, a simple solution consists in procedurally refining coarse meshes as they get closer to the camera. In this chapter, we are interested in deriving such a procedural refinement technique for arbitrary polygon meshes. Traditionally, mesh refinement has been computed on the CPU via re- cursive algorithms such as quadtrees [Duchaineau et al. 97, Strugar 09] or subdivision surfaces [Stam 98, Cashman 12]. Unfortunately, CPU-based refinement is now fundamentally bottlenecked by the massive CPU-GPU streaming of geometric data it requires for high resolution rendering. In order to avoid these data transfers, extensive work has been dedicated to implement and/or emulate these recursive algorithms directly on the GPU by leveraging tessellation shaders (see, e.g., [Niessner et al. 12,Cash- man 12,Mistal 13]). While tessellation shaders provide a flexible, hardware- accelerated mechanism for mesh refinement, they remain limited in two respects. First, they only allow up to log2(64) = 6 levels of subdivision. Second, their performance drops along with subdivision depth [AMD 13]. In the following sections, we introduce a GPU-based refinement scheme that is free from the limitations incurred by tessellation shaders.
    [Show full text]